(PART*) Intro

Data Privacy Handbook

Last updated: 04 May 2023

The Data Privacy Handbook is a practical guide on handling personal data in scientific research, primarily written for Utrecht University researchers and research support staff

It consists of:

  • A knowledge base which explains how the EU General Data Protection Regulation (GDPR, Dutch: Algemene Verordening Gegevensbescherming) applies to scientific research, including guidelines and good practices in carrying out GDPR-compliant scientific research;
  • An overview of privacy-enhancing techniques & tools and practical guidance on their implementation;
  • Use Cases in the form of research projects with privacy-related issues, for which a reusable solution (e.g., tool, workflow) is shared.

This is an Utrecht University (UU) community-driven, open source project. We welcome feedback and contributions of any type, please read our contributing guidelines for more information.

About

The Data Privacy Handbook is an initiative of Research Data Management Support, in collaboration with privacy and data experts at Utrecht University. It is part of a larger project, the Data Privacy Project, which aims to develop knowledge, tools, and experience on how researchers can and should deal with personal data. This project is funded by the Utrecht University Research IT Program and an NWO Digital Competence Center grant.

License and Citation

The Data Privacy Handbook is licensed under a Creative Commons Attribution 4.0 International License. You can view the license here.

Contributions

The Data Privacy Handbook is a collaborative effort, made possible by a large number of contributors (also to be viewed in our GitHub repository):

Neha Moopen, Dorien Huijser, Jacques Flores, Mercedes Beltrán, Kasper de Bruijn, Wies Cipido, Ruud Dielen, David Gecks, Joris de Graaf, Judith de Haan, Saskia van den Hout, Frans Huigen, Artan Jacquet, Rik Janssen, Sanne Kleerebezem, Annemiek van der Kuil, Danny de Koning-van Nieuwamerongen, Pieter Sebastiaan de Lange, Frans de Liagre Böhl, Maisam Mohammadi Dadkan, Francisco Romero Pastrana, Najoua Ryane, Johanneke Siljee, Maarten Schermer, Raoul Schram, Ron Scholten, Garrett Speed, Robert Steeman, Jacqueline Tenkink-de Jong, Liliana Vargas Meleza, and Martine de Vos.

Would you like to contribute to this Handbook yourself? Please read our Contributing guidelines.

How to use this Handbook

The Data Privacy Handbook aims to make knowledge and solutions on handling personal data Findable, Accessible, Interoperable, and Reusable (FAIR) and present them in a practical format.

The Handbook need not be read like a textbook. You are invited to navigate to the topic you need based on the table of contents, or use the guide below.

Disclaimer

The content presented in the Data Privacy Handbook has been carefully curated by Research Data Management Support, in collaboration with privacy officers and data experts of Utrecht University.

The Data Privacy Handbook is a ‘living’ book that is continually being written, updated and reviewed. Its contents can therefore change, or become outdated or redundant. Hence, the information presented is provided “as is”, without guarantees of accuracy or completeness.

As scientific research may differ depending on the discipline, topic, and context, measures needed or taken to ensure GDPR-compliance will vary across research projects. The authors can therefore not be held responsible, nor accountable for any negative consequences arising from interpretation and use of the content of the Data Privacy Handbook.

The Handbook is not endorsed by the Board of Utrecht University and does not constitute a mandatory directive. For the most up-to-date and official/ authoritative information, please refer to the university website and intranet, to which this Handbook is a hands-on, practical supplement. Moreover, before implementing the guidance laid out in this Handbook, always seek the advice of your privacy officer or RDM Support to confirm the suitability of any proposed solution to your project.

Throughout the Data Privacy Handbook, links to external webpages may be provided for additional information or assistance. The authors of the Data Privacy Handbook are not responsible for the content of any such linked webpages, nor is the content of external webpages necessarily endorsed by Utrecht University.

Utrecht University is committed to sharing knowledge in line with the principles of open science and therefore welcomes readers from outside of the organization. However, the contents of the Data Privacy Handbook may not be in line with readers’ institutions’ policies or views. For more authoritative information, these readers should refer to resources from their own institutions.

Privacy FAQs

On this page you can find Frequently Asked Questions (FAQs) about handling personal data in research. Click a question you have to read its answer.

General questions

When should I be dealing with privacy in my project?
You should think about privacy:
  • as soon as you are processing personal data. Processing means anything you do with personal data, e.g., collecting, analysing, sharing, storing, etc. The definition of personal data is explained in the chapter What are personal data?.
  • during the earliest stages of your project. This principle is called “privacy by design”. It is easier and more effective to address any privacy issues at the design phase of your project rather than having to change your plans later on due to privacy concerns.

When are data truly anonymous?
You can read all about this in the chapters What are personal data? and Pseudonymisation and anonymisation.

What should I consider when handling personal data?
It is best to conduct a privacy scan to check if you work with personal data. The below figure summarises what you need to describe, determine and do as part of a privacy scan: You should describe your purpose (what will you use the personal data for?), your data subjects (who are they, what is your relationship with them), which personal data you will use, what you will do with the sensitive data (e.g., collect, store, analyse, share), and whether you will share any personal data with other parties. You should determine which legal basis you will use (e.g., consent, public interest), what privacy risks there are in your project and how you will mitigate them, whether you need a Data Protection Impact Assessment (DPIA), and how data subjects will be able to exercise their rights. Finally, you shoud act: inform data subjects, apply organisational and technical measures to protect the data, and maintain privacy-related documentation as long as you process personal data.

My data were collected prior to the GDPR, what rules do I need to follow?
The GDPR applies to all personal data, including those collected prior to the GDPR (May 2018). Therefore, there is really no difference between how personal data should be handled before or after the advent of the GDPR.

My data were collected outside of the EU, does the GDPR apply to them?
Yes, as long as personal data are being processed, and the data controller, data processor, or data subject reside(s) in the European Economic Area, the GDPR applies

How sensitive are my data?
Personal data can differ in sensitivity, depending on the type of data (e.g., sensitive personal data), of whom the data were collected (e.g., healthy adults, children, patients, elderly, etc.) and on which scale. Data classification and a Data Protection Impact Assessment are useful tools to assess how sensitive the data are.

Procedures and responsibilities

Who is responsible for correctly handling personal data?
Legally, the controller of the personal data is responsible, i.e., the people or organisation responsible for the project activities. If you are an employee at Utrecht University (UU), the UU is legally the controller. The UU however delegates this responsibility to the appropriate employee who is actually in charge of determining why and how personal data are handled. In a research context, this is usually the researcher on the project (e.g., PhD candidate, principal investigator).

What does the procedure look like for researchers at Utrecht University?
All researchers at UU have to write a Data Management Plan. Besides that, many faculties  require that a privacy scan is done and ethical approval is obtained. Preferably, a Data Management Plan and privacy scan (which has to sometimes be extended to a Data Protection Impact Assessment) are done (and preferably marked as positive by the relevant data steward/privacy officer) before the ethical review takes place. Once accepted by the ethical committee, you can then start your research project.

How long will the planning process  of my research take?
This differs per faculty, but you should count at least 1 month, if not more, to complete all planning activities. In terms of administrative work, you need to reserve time for:
  • writing a Data Management Plan and having it reviewed (a few days)
  • filling out the privacy scan and consulting with the privacy officer (a few days). If a DPIA needs to be conducted, this will take more time because the Data Protection Officer also needs to be consulted.
  • creating information for data subjects and potentially a consent form.
  • going through ethical review: it can take up to 1 month before a first decision is taken by some faculty review boards, or longer for the Medical-Ethical Review Board.
  • in some projects, setting up an agreement.
In general, designing your research with correctly processing personal data in mind will cost you less effort In the long run: Start as early as possible!

Doesn’t the ethical committee also look at privacy?
Partly, although this differs per UU faculty. In most faculties, there is a collaboration between privacy and ethics. For example, at the Faculties of Social and Behavioural Sciences, the Humanities, and Geosciences, privacy is included in the ethical application, but the privacy aspect of it is outsourced to the faculty privacy officer. For you as a researcher, it is wise to first complete a draft privacy scan, and consult with the faculty privacy officer and only then do the ethical application, so you have already thought about the privacy aspect before the ethical review process starts.

Storing personal data

Where should I store physical personal data?
Physical personal data should be stored in a locked area that only a select group of people has access to. The exact location will depend on the type of data (e.g., consent forms, filled out questionnaires, biomedical samples, etc.), and where you work. If possible, we recommend digitising and then destroying any paper materials in order to have the data in a secure and backed-up location.

Where to store participants’ contact information?
Similarly to informed consent forms, you should store contact information on a different location than the research data and well-protected (strict access control, encryption, etc.). For example, store the research data on Yoda, and the contact information in a controlled OneDrive or ResearchDrive folder. Delete the contact information when you do not longer need them (e.g., after the research project has ended).

Sharing, publishing and reusing personal data

Can I publish personal data?
This is not only a privacy issue, but also an ethical one. You can in principle ask consent to publish personal data (either publicly or under restricted access), or in some cases rely on public interest to do so. Because the data will remain protected by the GDPR, anyone (re)using the data will have to abide by the GDPR as well (the requirements travel with the personal data). However, even if you have a legal basis to publish personal data, it still may not always be ethical to do so. For that reason, we recommend always obtaining ethical approval, including when you want to publish personal data. You can read more about sharing and publishing personal data for reuse in the Sharing data for reuse chapter.

How can I share personal data with collaborators?
If the collaborator resides outside of your institute, but within the European Economic Area (EEA) or an “adequate” country, it is possible to share personal data with them, provided that data subjects are informed, there is a (joint controllers) agreement with them, and other safeguards are in place (e.g., pseudonymisation). Please contact your privacy officer if the collaborator is located outside the EEA in a country without an adequate level of data protection.

How can I share data with a third party outside of the EEA?
Personal data can be shared outside of the EEA if one of the following applies:
  • Participants have given their explicit consent after having been well informed of the risks.
  • The transfer is necessary for important reasons of public interest.
  • The data are transferred to a non-EEA country that has been deemed adequate by the European Commission.

The above apply only to “occasional” transfers. For frequent transfers, Standard Contractual Clauses should be drafted, although this requires a greater commitment from the third parties, and may require more in-depth legal assistance to establish.

What should I do if some participants do not consent to sharing their data?
This depends on the identifiability of the data and the legal basis: if it is still possible to identify individuals, then data subjects can withdraw their consent, and you won’t be able to share their data for reuse. However, if the data are altered in such a way that you can no longer identify individuals within the dataset, then you can share their data for reuse. Of note, it is not always necessary to ask people their consent for data reuse for scientific purposes - consult your privacy officer. You can read more about this in the Sharing data for reuse chapter.

Can I reuse medical data for research purposes?
You likely can. The GDPR has a derogation that specifies that secondary use for research is “not incompatible with the initial purposes” (art. 5(1)(b)), meaning that it is allowed to reuse data for research, provided that you protect the data sufficiently. As with any research project, we recommend to conduct a privacy scan to assess the legality of your project, and to obtain ethical approval to assess the ethical aspects of your project.

Can I use personal data that are already published by other researchers?
You generally can, depending on the license or terms of use that the dataset has, and assuming that the researcher who published the data had a legal basis to do so. In general, it is possible to reuse personal data for scientific research, as long as appropriate safeguards are in place (art. 89).

Can I reuse contact details for a new study?
This depends on how data subjects were informed about potential reuse of their contact details: can they expect to be contacted again and for this purpose? Note that you should have obtained access to the contact details legitimately too: are you supposed to have access to their contact details in the first place? If you are uncertain about this, ask your privacy officer for help.

Practical questions

I am using hardware to collect personal data. What should I take into account?
There are many security aspects to consider when using hardware (e.g., tablets, cameras, phones, etc.), such as whether and where any personal data is recorded and whether the device is approved by the university, see this link for more information. Make sure that you transfer the data to secure storage as soon as possible and consider measures (such as encryption) that ensure that data are protected if the hardware is lost or stolen. When you use video recording hardware, be mindful of what is recorded, also in the background. For example, be aware when filming around open laptops, documents or vulnerable people.

I want to combine data from multiple sources. How can I do so securely?
There are multiple factors to consider, depending on the type of research, the ownership of the data, involved parties, etc. As a rule of thumb, practice data minimisation, only keep the fields or variables you need. Be mindful of data ownership: if someone else owns the data, keep that dataset separate. For more information and tailored advice, contact RDM Support.

How to generate suitable pseudonyms?
A pseudonym can be a random number, cryptographic hash function, text string, etc. It is important that the pseudonym is not meaningful with respect to the data subjects: a random (unique) number or string is better than a code that contains parts of personal information, because the latter may reveal details about data subjects.

How to pseudonymise qualitative data?
Textual data is often redacted (either manually or using a tool so that identifiable information is removed or replaced with a placeholder text. There are now also tools for masking or blurring video data and distorting audio. Note that sometimes it is not possible to anonymise or pseudonymise qualitative data, because you may lose too much valuable information, or because the data are just too revealing (e.g., face, voice, gestures, posture in video data, language use in audio data). In that case, other measures like access control, safe storage, and encryption may be more suitable.

I am analysing my data in a git repository to ensure reproducibility. How can I make sure I do not accidentally push the data to GitHub?
Before you put your data in your git repository, place a line in the .gitignore file that prevents tracking the data. This way, when pushed to GitHub, the data will not be pushed alongside the other files in the repository - only the folder name will be visible.

Please note that if the data were tracked by git before, adding a line to your .gitignore will not prevent the data from being tracked. In this case, it is best to create a new git repository where you add a .gitignore file from the start, and delete all old versions from GitHub if there were any. If you delete the data, add the line to the .gitignore file, and then re-add the dataset, the tracking history from before the .gitignore will still exist and be pushed to GitHub.

Sidenote: it is possible to override the .gitignore file by force. This will likely not happen accidentally, but it is important to realise that the .gitignore file is not iron clad. You can read more on the gitignore here.

How to securely send participant data to participants?
In the same protected way as when you would send personal data to fellow researchers. Researchers at Utrecht University can for example use SURF filesender with encryption or share a OneDrive or Research Drive file. Be sure not to share any data from other participants or other researchers!

How to work responsibly with social media data?
See these guidelines (in Dutch) about working with social media data. Every social media platform has different terms and conditions. Read these to see what you are, and are not, allowed to do with the data published on the platform you wish to research.

Where can I find relevant or approved tools?
Researchers at Utrecht University can find tools via https://tools.uu.nl and the intranet. We also curated an overview of several tools to handle personal data in this GitHub repository.

Where can I find privacy-related templates and examples for research?
Please refer to the Documents and agreements chapter or the RDM website. For others, please contact your privacy officer and/or your Ethical Review Board.

Students and student data

Can I reuse educational data (e.g., grades, course evaluations) for my research?
It is possible, but its compliance would have to be documented in a privacy scan to explain why this further processing for scientific purposes is compliant with the GDPR. Please refer to the use case about this topic for an example.

Can I share my research data containing personal data with my students?
Preferably not. Especially in a classroom setting, students should work on anonymised data as much as possible. For thesis students, only share personal data with them as strictly necessary and make sure that the students know how to safely handle the personal data. Additionally, data subjects should be informed that these students will handle their data.

Can I (re)use personal data collected by my students?
You should check what information was given to data subjects to see whether it is possible to reuse the data. In general, if data are deidentified and are going to be used for research, it is possible to make this data reuse legitimate - a privacy scan may be able to demonstrate this.

When students collect personal data, who is responsible for correct handling of those data?
The supervisor is the main person responsible, but students are also co-responsible, especially if they are taking decisions on the data themselves. Students need to comply with their respective obligations and responsibilities to ensure data is kept safe and protected.

Can a student take research data containing personal data with them to publish about them later?
It depends on why this is considered necessary, if data subjects have been informed, if data minimisation and deidentification are applied, etc. If students take data with them, they will probably end up being stored on a free cloud solution such as Google Drive or Dropbox. Make sure your data subjects are informed about this beforehand and realise that obtaining consent will be more difficult. A privacy scan should document why this is compliant with the GDPR.

I am a student, where can I store my data?
If you are student who will be collecting personal data for research, it is the responsibility of your supervisor or course coordinator to supply you with access to an approved storage solution. Please do not use a personal device or commercial cloud solutions like Dropbox or Google Drive to store research data containing personal data. Any “free” commercial solution will scrape and analyse what you store and thus your data are not safe there.

Finding support

Where can I learn more?
For Utrecht University researchers, the most relevant information and support can be found here:

Who is the Data Protection Officer (DPO)?
The Data Protection Officer (Dutch: Functionaris Gegevensbescherming, FG) oversees an organisation’s compliance to the General Data Protection Regulation (GDPR). In research, the DPO is sometimes involved in a Data Protection Impact Assessment and in some cases in possible data breaches. If you work at Utrecht University, you can read more about the DPO’s role here.

I have a potential data breach, what should I do?
If you work or study at Utrecht University, please report this as soon as possible to the Computer Emergency Response Team (CERT).

(PART*) Knowledge Base

The GDPR

This chapter will present the most important definitions, principles and rights of data subjects outlined in the GDPR and how it applies to your research. Most of the practical advice that we provide in this Handbook will be rooted in and builds on the concepts presented here.

Chapter summary

The GDPR is a EU-wide regulation that controls the processing of personal data. If you process personal data, you should:

  • Make sure you have a legal basis to process the data. In research, this is often informed consent.
  • Be transparent and fair towards data subjects.
  • Be specific in which personal data you process and for what purposes. Limit the amount of data you process to what is necessary, and only store the data for that necessary amount of time.
  • Protect the confidentiality of the data by incorporating privacy by design into your project from the start.
  • Make sure your data subjects can exercise their data subjects’ rights, and they know how to do so.

What is the GDPR?

On this page: gdpr, when privacy, uavg

The General Data Protection Regulation (GDPR, Dutch: Algemene Verordening Gegevensbescherming [AVG]) is an EU-wide regulation meant to protect the privacy of individuals within a rapidly growing technological society. The GDPR facilitates the free movement of personal data within the European Economic Area (EEA). Its data processing principles are meant to ensure a fair balance between competing interests – for example, the right to conduct research vs. the right to protect personal data (Articles 13 and 8, from the Charter of Fundamental right of the EU).

The GDPR in a nutshell

All articles and recitals of the GDPR can be found online via https://gdpr-info.eu/. The video below highlights some important aspects of the GDPR:

Click to read the English video transcript
The General Data Protection Regulation (GDPR) regulates what we can and cannot do with personal data such as a person’s name, sexual orientation, home address and health. This also applies to personal data used in research and education. The regulation consists of 88 pages. Fortunately, the basics are easy to remember in 3 steps:
  1. First, there must be a clear legal basis for processing personal data. This can include consent, a legal obligation, or public interest.
  2. Second, appropriate technical and organisational measures must be taken while processing personal data to ensure maximum privacy.
  3. Lastly, the persons whose data you have collected must always have the option of inspecting, changing, or removing their personal data.
That is the GDPR in a nutshell.

When does the GDPR apply?

The GDPR has been applicable from May 2018 onward and applies when:

  • you are processing personal data (material scope, art 2).
  • the controller or processor of the data resides in the EEA (territorial scope, art. 3). This is independent of whether the actual processing takes place in the EEA. In some cases, the GDPR also applies when the controller or processor is not established in the EEA, but is processing data from EU citizens.

If you are collecting or using data that originated from individuals (or is related to individuals), it is very likely that the GDPR applies to your project. You can read more in the chapter What are personal data?.

Implementation

While the GDPR is a regulation for the entire EEA, each EEA country can additionally implement further restrictions and guidelines in national implementation laws. The Dutch implementation law is called “Uitvoeringswet AVG (UAVG)”. The UAVG determines, for example, that it is forbidden to process Citizen Service Numbers (BSN), unless it is for purposes determined by a law or a General Administrative Order (AMvB).

Definitions in the GDPR

On this page: glossary, sensitive data, personal data, process, controller, processor, participant, data subject, special categories, legal ground, legal basis, anonymised, pseudonymised

Below, you will find a selection of important terms in the GDPR that you should become familiar with when working with personal data (also included in the Glossary). Click a term to see the definition.

Data subject
A living individual who can be identified directly or indirectly through personal data. In a research setting, this would be the individual whose personal data is being processed (see below for the definition of processing).
Personal data

Any information related to an identified or identifiable (living) natural person. This can include identifiers (name, identification number, location data, online identifier or a combination of identifiers) or factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of the person. Moreover, IP addresses, opinions, tweets, answers to questionnaires, etc. may also be personal data, either by itself or through a combination of one another.

Of note: as soon as you collect data related to a person that is identifiable, you are processing personal data. Additionally, pseudonymised data is still considered personal data. Read more in What are personal data?.

Special categories of personal data
Any information pertaining to the data subject which reveals any of the below categories:
  • racial or ethnic origin
  • political opinions
  • religious or philosophical beliefs
  • trade union membership
  • genetic and biometric data when meant to uniquely identify someone
  • physical or mental health conditions
  • an individual’s sex life or sexual orientation
The processing of these categories of data is prohibited, unless one of the exceptions of article 9 applies. For example, an exception applies when:
  • the data subject has provided explicit consent to process these data for a specific purpose,
  • the data subject has made the data publicly available themselves,
  • processing is necessary for scientific research purposes.

Contact your privacy officer if you wish to process special categories of personal data.

Processing
Any operation performed on personal data. This includes collection, storage, organisation, alteration, analysis, transcription, sharing, publishing, deletion, etc.
Controller

The natural or legal entity that, alone or with others, determines or has an influence on why and how personal data are processed. On an organisational level, Utrecht University (UU) is the controller of personal data collected by UU researchers and will be held responsible in case of GDPR infringement. On a practical level, however, researchers (e.g., Principal Investigators) often determine why and how data are processed, and are thus fulfilling the role of controller themselves.

Note that it is possible to be a controller without having access to personal data, for example if you assign an external company to execute research for which you determined which data they should collect, among which data subjects, how, and for what purpose.

Processor
A natural or legal entity that processes personal data on behalf of the controller. For example, when using a cloud transcription service, you often need to send personal data (e.g., an audio recording) to the transcription service for the purpose of your research, which is then fulfilling the role of processor. Other examples of processors are mailhouses used to send emails to data subjects, or Trusted Third Parties who hold the keyfile to link pseudonyms to personal data. When using such a third party, you must have a data processing agreement in place.
Legal basis
Any processing of personal data should have a valid legal basis. Without it, you are now allowed to process personal data at all. The GDPR provides 6 legal bases which are explained further in this chapter.
Anonymous data
Any data where an individual is irreversibly de-identified, both directly (e.g., through names and email addresses) and indirectly. The latter means that you cannot identify someone:
  • by combining variables or datasets (e.g., a combination of date of birth, gender and birthplace, or the combination of a dataset with its name-number key)
  • via inference, i.e., when you can deduce who the data are about (e.g., when profession is Dutch prime minister, it is clear who the data is about)
  • by singling out a single subject, such as through unique data points e.g., someone who is 210 cm tall is relatively easy to identify)

Anonymous data are no longer personal data and thus not subject to GDPR compliance. In practice, anonymous data may be difficult to attain and care must be given that the data legitimately cannot be traced to an individual in any way. The document Opinion 05/2014 on Anonymisation Techniques explains the criteria that must be met for data to be considered anonymous.

Pseudonymous data
Personal data that cannot lead to identification without additional information, such as a key file linking pseudonyms to names. This additional information should be kept separately and securely and makes for de-identification that is reversible. Data are sometimes pseudonymised by replacing direct identifiers (e.g., names) with a participant code (e.g., number). However, this may not always suffice, as sometimes it is still possible to identify participants indirectly (e.g., through linkage, inference or singling out). Importantly, pseudonymous data are still personal data and therefore must be handled in accordance with the GDPR.

Principles in the GDPR

On this page: legal basis, legal ground, fair, transparent, purpose, goal, aim, minimise, accurate, storing, storage, safeguards, measures, responsible, responsibility

The GDPR has a number of principles at its core which dictate the (method of) data processing. Every type of processing has to comply with these principles. Understanding these principles is the first step to determining what type of personal data can be collected and how they can processed.

The GDPR principles are explained further below the image. The Design chapter describes how to implement these principles in your research. You can also always contact your privacy officer.

Overview of the principles, denoted in text on a yellow background, with an 
icon that visualises each principle: Lawful, fair and 
transparent, Purpose limitation, Data minimisation, Accuracy, Storage limitation, 
Integrity and Confidentiality, and Accountability

1. Lawful, fair and transparent

When working with personal data, your processing should be:

  1. Lawful
    Make sure all your processing activities (e.g., data collection, storage, analysis, sharing) have a legal basis. Ideally, you should have determined your processing purposes (e.g., research questions) in advance.
  2. Fair
    • Consider the broad effects of your processing on the rights and dignity of the data subject.
    • Give data subjects the possibility to exercise their rights.
    • Avoid deception in the communication with data subjects: processing of personal data should be in line with what they can expect.
    • The processing of personal data should not have a disproportionate negative, unlawful, discriminating or misleading effect on data subjects.

  3. Transparent
    Be transparent in the communication to your data subjects about who is processing the personal data (controllers, processors), which personal data are processed, as well as why and for how long, and how data subjects can exercise their rights. The information provided should be unambiguous, concise, easily accessible and relevant and shared with data subjects before the start of your research.

2. Purpose limitation

You can only process (i.e., collect, analyse, store, share, etc.) personal data for a specific purpose and only for as long as necessary to complete that purpose. For example, if you communicated to data subjects that you would use their personal data only to answer your specific research question, you cannot further share the personal data for new research questions, as these would be additional processing purposes. This means that you need to plan what you will do with the (collected) personal data in advance and stick to that plan in order to be GDPR-compliant.

3. Data minimisation

You can only process the personal data you need to for your predefined purpose(s), and not more just because they may “come in handy later”. This principle makes sure that, for example, in the event of a data breach, the amount of data exposed is kept to a minimum.

4. Accuracy

The accuracy of personal data is integral to data protection. Inaccurate data can be a risk for data subjects, for example when they lead to a wrong treatment in a medical trial. You therefore need to take every reasonable step to remove or rectify data that is inaccurate or incomplete. Moreover, data subjects have the right to request that inaccurate or incomplete data be removed or rectified within 30 days.

5. Storage limitation

You can only store personal data for as long as is necessary to achieve your (research) purpose. Afterwards, they need to be removed. If the personal data are part of your research data (and not, for example, to simply contact data subjects), you are allowed to store (archive) them for a longer period of time, provided necessary safeguards are in place. This is an exemption that applies to data storage for scientific archiving purposes. You need to inform the data subjects on this storage duration beforehand.

If identification of the data subject is no longer needed for your (research) purposes, you do not need to keep storing the personal data just to comply with the GDPR, even if it means your data subjects cannot exercise their rights (art. 11).

6. Integrity and confidentiality

You have to process personal data securely and protect against unauthorised processing or access, loss or damage. To this end, you should put in place apropriate organisational and technical measures.

7. Accountability

The controller is ultimately responsible for demonstrating GDPR-compliance. As a researcher working with personal data, you are representing your institution (e.g., Utrecht University) and you should therefore be able to demonstrate that you process personal data in a compliant manner. Additionally, you should also have some knowledge of data protection so that you can implement the right measures into your research project.

Data Subjects’ Rights

On this page: rights of participants, right, withdrawing consent, delete data

The GDPR provides data subjects with several rights that gives them a relatively high degree of control over their own personal data. Below, we list these rights and how you can apply them in your research:

  1. Right to be informed
    Data subjects need to be clearly informed about what you are doing with their personal data (a.o. art. 12). This usually happens via an information letter. This right does not apply if your research will be seriously harmed by meeting it and if you haven’t obtained the personal data directly from the data subjects themselves.
  2. Right of access
    Data subjects have the right to access a copy of the personal data you have on them and to know what you are doing with that personal data and why (art. 15).
  3. Right to rectification
    Data subjects have the right to correct and complement the personal data that you have of them (art. 16).
  4. Right to erasure/Right to be forgotten
    Data subjects have the right to have their personal data removed (i.e., equivalent to the right to withdraw consent, art. 17). This right does not need to be granted if:
    • the personal data are published and need to be archived for validation purposes.
    • it would seriously obstruct the research purpose(s).
    • it would hinder complying with a legal obligation or carrying out a task in the public interest.

    If the personal data have already been made public or shared, you need to take reasonable measures to inform other users of the data of the erasure request. A privacy officer can help you with this.

  5. Right to restriction of processing
    Data subjects have the right to have you process less of their personal data (art. 18), for example if their personal data are inaccurate or your processing of it is unlawful or no longer needed.
  6. Right to data portability
    Data subjects have the right to have their personal data transferred to another party in a “structured, commonly used and machine-readable format” (art. 20).
  7. Right to object
    Data subjects have the right to object to what you are doing with their personal data. This right applies when the processing is based on legitimate or public interest (art. 21). In case of objection, you have to stop your processing activities and thus delete any data you have from the particular data subject, unless you can demonstrate concrete grounds for overriding the data subject’s rights (e.g., excluding the data subject would substantially bias your results).

How can data subjects exercise their rights?

Data subjects need to be informed about their rights and who to contact in order to exercise them, including when you use a legal basis other than informed consent. In research, this is usually done via a privacy notice or information letter, which states a contact person responsible for handling questions and requests.

Incoming requests need to be coordinated with a privacy officer, so that they can be picked up in accordance with the GDPR. Additionally, at Utrecht University, data subjects can always contact (the Data Protection Officer) for requests or complaints.

What to do when receiving a request concerning data subjects’ rights?

You have to provide a substantive response to the data subject within 30 days, in the same way as you received the request. Depending on the complexity and number of requests, the response period may be extended by 2 months. In that case, you must inform the data subject about this extension (including the motivation) within one month. If needed, you can (and sometimes should) ask for additional information to confirm the data subject’s identity.

For granting requests about data subjects’ rights, there should be a procedure in place, in which you should at least consider:

  • how you are going to retrieve the data (e.g., using a name-number key)
  • who is responsible for granting the request and informing the data subject about it (e.g., a data manager)
  • how the request is going to be granted, for example how they will be sent securely (access, portability), removed (forgotten, object, restriction) or corrected (rectification)

For larger projects, it may be wise to put a Standard Operating Procedure (SOP) in place.

What if the data have already been anonymised?

The principles of data minimisation and storage limitation are considered more important than keeping personal data just for the sake of identification (art. 11). Therefore, when receiving a request about anonymised data, you can make it clear that you cannot retrieve the data subject’s personal data, because they have been anonymised. In this case, the data subject cannot exercise their rights anymore. If you can still retrieve the data subject’s personal data in some way (i.e., when data are pseudonymised), you are obliged to retrieve them. In order to do so, you can (and sometimes should) ask for additional information that can confirm the data subject’s identity.

What are personal data?

In order to know whether you should comply with the GDPR in your research project, the first question to answer is: do you process personal data? To answer this question, we need to know: (1) What exactly are personal data, and (2) how do you know if you are working with personal data in your research?

Definition of personal data

According to the GDPR, personal data are “any information relating to an identified or identifiable natural person” (art. 4(1)):

  • Natural person: Data by themselves (numbers, text, pictures, audio, etc.) are not inherently personal. They only become personal when they refer to or relate to a living individual. When data refer to an organisation, deceased person, or group of individuals, they are not considered personal data under the GDPR.
  • Data are personal if they relate to an individual. This means practically anything that someone is, has said or done, owns, may think, etc.
  • The person should be identified or identifiable. This is the case not only through directly identifying information, such as names and contact information, but also through indirectly identifying information, for example if you can single someone out or identify them by combining datasets (see the next page).

How to assess whether data contain personal data?

On this page: sensitive data, privacy-sensitive, personal data, when is data privacy-sensitive, identifiability, identifier

Whether your data contain personal data depends on which data you are collecting (nature) and under which circumstances (context). A date like “12 December 1980”, is not personal data – it is just a date. However, that date becomes personal data if it refers to someone’s birthday.

In assessing whether data are personal, you should take into account all the means that you and others may reasonably likely use to identify your data subjects, such as the required money, time, or (future) developments in technology (rec. 26).

Data can be identifiable when:

  • They contain directly identifying information.
    For example: name, image, video recording, audio recording, patient number, IP address, email address, phone number, location data, social media data.
  • It is possible to single out an individual

    This can happen when there are unique data points or unique behavioural patterns which can only apply to one person.

    Examples:

    • You have a data subject who is 2.10 meters tall. If this is a unique value in your dataset, this distinguishes this person from others and thus can make them identifiable.
    • You have a data subject who only follows far-right accounts on Twitter. If they are the only one in your dataset who do so, this distinguishes this person from others and can make them identifiable.

  • It is possible to infer information about an individual based on information in your dataset
    For example:
    • Inferring a medical condition based on registered medications.
    • Guessing that someone lives in a certain neighbourhood based on where they go to school.
  • It is possible to link records relating to an individual.

    This can happen when combining multiple variables within your dataset (e.g., demographic information, indirect identifiers). However, it can also happen when combining your dataset with other datasets (the “Mosaic effect”). In that case, your data still contain personal data, even if the data in your own dataset are not identifiable by themselves.

    Linkage is often possible with demographic information (age, gender, country of origin, education, workplace information, etc.) and indirect identifiers (pseudonyms, device ID, etc.), for example:

    • In the year 2000, 87% of the United States population was found to be identifiable using a combination of their ZIP code, gender and date of birth. You can see for yourself on this website.
    • An agricultural company’s Uniek Bedrijfsnummer (UBN) can be used to search for the address of the company in the I&R mobile app. Often, this address is also the owner’s home address.
    • Geographical data tracking individuals are particularly sensitive because of the multiplicity of data points. This video nicely explains why.

  • De-identification is still reversible.
    This often happens when data are pseudonymised, but there is still a way to link the pseudonymised data with identifiable data, for example when a name-pseudonym key still exists.

You can assume that you are processing personal data when you collect data directly from people, even if the results of that collection are anonymous. But also when you use data that are observed or derived from people, even if those data were previously collected, made public or used for non-research purposes.

In short, even if you cannot find out someone’s real identity (name, address), the data you process can still contain personal data under the GDPR. Besides the examples mentioned here, there are many other examples of personal data. If you need help assessing whether or not your data contain personal data, please contact your privacy officer.

Special types of personal data

On this page: sensitive personal data, sensitive data, special category, special categories, politics, race, ethnicity, religion, philosophy, dna, genetics, genes, fingerprint, physical condition, mental illness, sexual identity, gender identity

There are a few special types of personal data that are worth taking note of: special categories of personal data, and otherwise sensitive personal data. These types of personal data have additional requirements. If you want to process them, please contact your privacy officer first.

Special categories of personal data

The GDPR explicitly defines seven ‘special categories of personal data’. It is information that reveals:

  • racial or ethnic origin
  • political opinions
  • religious or philosophical beliefs
  • trade union membership
  • genetic or biometric data when meant to uniquely identify someone
  • physical or mental health conditions
  • sex life or sexual orientation

It is in principle prohibited to process these types of personal data, unless an exception applies (art. 9). For example, it is allowed to process these if:

  • Data subjects have provided explicit consent to process these data for a specific purpose.
  • Data subjects have made the data publicly available themselves
  • Processing is necessary for scientific research purposes (incl. historical and statistical purposes) and it is impossible or would take an unreasonable amount of effort to obtain explicit consent (UAVG art. 24).

Even if you can make use of one of these exemptions, special categories of personal data warrant additional security measures to make sure they are protected. Always contact your privacy officer if you intend on processing these types of data.

The Dutch Code of Conduct for Health Research (p.68) specifies a number of exceptions for health researchers in which explicit consent for processing special categories of personal data may not be necessary.

Data that are otherwise sensitive

Other types of data can also be sensitive, because they can carry higher risks for the data subjects. These types of data can either not be processed at all, or only under certain circumstances. Either way, they require additional security measures. Always contact your privacy officer if you intend on using these types of data.

Examples are:

  • Financial data
  • Data about relationship problems
  • Data that can be misused for identity fraud, such as the Dutch Citizen Service Number (BSN). In principle, the BSN cannot be used in research at all.
  • Criminal or justice-related data: they can only be processed under governmental supervision or when a derogation exists in national legislation (art. 10).

Designing your project

On this page: privacy by design, start early, preparation

Research projects typically go through a number of stages: conception, proposal, planning, execution, publishing, preservation, etc. If you work with personal data, you should think about how you will protect those data throughout all those stages. To do so, the concepts of Privacy by Design and Privacy by Default (art. 25) are important:

  • Privacy by Design in research means that your project integrates personal data protection right from the beginning, all the way throughout the project, and even afterwards. It should not be an afterthought: Privacy by Design is a key feature of the project, permeating all phases of a research project.
  • Privacy by Default in research means that any questions, tools, or methods you use in your research should process as little personal data as necessary by default, and that you share the personal data only with those who really need access.

To get proper support in designing your project, it is important to contact your privacy officer early on, preferably already in the conception or design phase. The privacy officer will help you go through the different stages smoothly, and eventually save you time and effort. They can help you review and possibly adjust your plans, determine the appropriate protection measures, and determine whether you need to perform a more elaborate Data Protection Impact Assessment.

Privacy in the research cycle. 
From Conception (Hypothesis generation, ideas), to Designing the project
(Grant/project proposal, Draft privacy scan, Consult with privacy officer, 
data manager, grant officer), to Grant/project approved (Privacy scan, possibly 
DPIA, Data Management Plan, Privacy notice and consent form, Ethics review, 
Agreements between parties) to Data acquisition (Data collection and reuse), Data 
Processing (Preprocessing, Analysis, Output generation), to Preservation (of data,
code and documentation) and Publication (Manuscript, Data, Code and Documentation)

Privacy by Design strategies

On this page: safeguards, measures, technical, organisational, procedure, design, access control, minimisation, transparency, pseudonymisation, abstraction, information, accountability, rights

To incorporate the concepts of Privacy by Design and Privacy by Default into your project, the approach of privacy design strategies (Hoepman, 2022) offers a way to make the GDPR principles more concrete. Hoepman distinguishes 8 strategies that you can apply to protect the personal data in your research: minimise, separate, abstract, hide, inform, control, enforce, and demonstrate. Below, we explain what these mean and how you can apply them.

The GDPR does not prescribe which specific measures you should apply in your project, only that they should protect the personal data effectively. Which measures will be effective, will depend on your specific project, the risks for data subjects, and the current progress in technology (i.e. will the data be protected on the long haul?).

Data-oriented strategies

minimise icon Minimise
separate icon Separate
abtract icon Abstract
hide icon Hide

Process-oriented strategies

inform icon Inform
control icon Control
enforce icon Enforce
demonstrate icon Demonstrate

minimise icon Minimise

Limit as much as possible the processing of personal data, for example by:

  • Collecting as little data as possible to reach your research purpose.
  • Collecting only personal data from the amount of individuals necessary.
  • Preferably not using tools that automatically collect unnecessary personal data. If possible, prevent tools you do use from doing so (Privacy by Default). For example, the survey tool Qualtrics can automatically register location data, which can be turned off by using the “Anonymize Responses” option.
  • Removing personal data when you no longer need them. Remove them from repositories, data collection tools, sent emails, back-ups, etc. (see also the Storage chapter). Use directly identifying information only if you legitimately need them, for example to keep in touch with data subjects or to answer your research question.
  • Pseudonymising or anonymising personal data as early as possible.
  • Use portable storage media only temporarily.

Back to top

separate icon Separate

Separate the processing of different types of personal data as much as possible, for example by:

  • Storing directly identifying personal data (e.g., contact information) separately from the research data. Use identification keys to link both datasets, and store these keys also separately from the research data.
  • Separating access to different types of personal data. For example, separate who has access to contact information vs. to the research data.
  • Applying secure computation techniques, where the data remain at a central location and do not have to be moved for the analysis.

Back to top

abtract icon Abstract

Limit as much and as early on as possible the detail in which personal data are processed, for example by:

  • Pseudonymising or anonymising the data.
  • Adding noise to the data, e.g., voice alteration in audio data.
  • Summarising the data to simply describe general trends instead of individual data points.
  • Synthesising the data, e.g., for sharing trends in the data without revealing individual data points.

Back to top

hide icon Hide

Protect personal data, or make them unlinkable or unobservable. Make sure they do not become public or known. You can for example do so using a combination of:

  • Using encryption, hashing or strong passwords to protect data. Consider using a password manager to avoid losing access to the data.
  • Using secure internet connections and encrypted transport protocols (such as TLS, SFTP, HTTPS, etc.). Do not connect to public WiFi on devices containing personal data.
  • Applying privacy models like Differential privacy, where noise is added to individual data points to hide their true identity.
  • Only providing access to people who really need it, and only for the necessary amount of time and with the necessary authorisations (e.g., read vs. write access; only the relevant selection of personal data, etc.). Remove authorisations when access is no longer required.
  • Encrypting and regularly backing up data on portable storage media.
  • Keeping a clear desk policy: lock your screen and store paper behind lock and key when you leave your desk.

Back to top

inform icon Inform

Inform data subjects about the processing of their personal data in a timely and adequate manner, for example by:

  • Providing information via an information letter or privacy notice on a project website.
  • Providing verbal explanation before an interview.
  • Obtaining explicit consent via an informed consent procedure.

Back to top

control icon Control

Give data subjects adequate control over the processing of their personal data, for example by:

  • Specifying a procedure and responsible person in case data subjects want to exercise their data subject rights.
  • Providing data subjects with a contact point (e.g., email address) for questions and exercising their data subject rights.

Back to top

enforce icon Enforce

Commit to processing personal data in a privacy-friendly way, and adequately enforce this, for example by:

  • Using only Utrecht University-approved tools to collect, store, analyse and share personal data.
  • Entering into agreements with third parties if they are working with UU-controlled personal data. Such agreements will make sure everyone will treat the data up to UU-standards.
  • Always keeping your software up-to-date and using a virus scanner on your devices.
  • Appointing someone responsible for regulating access to the data.
  • Always reporting (suspicions of) data breaches. At UU, contact the Computer Emergency Response Team.
  • If needed, drawing up a privacy and/or security policy that specify roles and responsibilities and best practices on how personal data are handled throughout a project.
  • Using a Trusted Third Party when linking individual data from different sources together.

Back to top

demonstrate icon Demonstrate

Demonstrate you are processing personal data in a privacy-friendly way, for example by:

  • Registering your research project in the UU processing register (once available).
  • Performing a Privacy Scan and storing it alongside the personal data.
  • Performing a Data Protection Impact Assessment (DPIA) for projects that have a high privacy risk for the data subjects.
  • Keeping information for data subjects and (signed) informed consent forms on file. This is not needed if you can fully anonymise the data: then you should delete the (signed) consent forms as well.

Back to top

Risk Assessment

When you work with personal data, you need to make sure that you correctly collect, store, analyse, share, etc. those data to avoid harm to data subjects. To do so, it is important to gain insight in:

  • The risks involved:
    Security risks occur when data are unexpectedly less available, less correct, or there is an unintended breach of confidentiality. They need to be mitigated by implementing integrity and confidentiality into your project.

    Privacy risks exist when your use of (personal) data, either expectedly or unexpectedly, affects the interests, rights and freedoms of data subjects. These can be Data Subjects’ Rights under the GDPR, but also other fundamental rights, such as the right to equality and non-discrimination, the right to life and physical integrity, freedom of expression and information, and religious freedom. In practice, we consider it a privacy risk if your processing of personal data can result in physical, material, or non-material harm to data subjects. Privacy risks should be mitigated by implementing all data protection principles into your project.

    When the risks for data subjects are high, an in-depth risk assessment in the form of a Data Protection Impact Assessment is needed.

  • The data classification: a classification of the data (low, basic, sensitive, critical) that is based on the risks for data subjects and the damages to an institute or project when data are incorrectly handled, there is unauthorised access, or data are leaked. This classification affects the security measures you need to take (e.g., which storage solution you choose, whether you need to encrypt the data, etc.).

Based on the risks you identified and the classification of the data, you can then implement safeguards to mitigate the risks.

How to assess privacy risks?

On this page: risk, security, assessment, harm, damage, dpia, threat, secure, measure, safeguard, protect, plan, probability, likelihood, impact

Before you start your research project, it is important to consider the risks and their severity for data subjects in your project. This assessment will inform you on which (additional) safeguards to put in place to mitigate the risks.

Privacy and security risks are usually outlined in a privacy scan or Data Protection Impact Assessment, and purely security risks in a data classification. If you create an algorithm that can affect people, an “Impact Assessment Fundamental Rights and Algorithms” may be required or combined with any of the before mentioned assessments.

Risk assessment step by step

When going through the below steps, take into account at least the following risk scenarios:

  • Data breach (unintended security risks): someone unauthorised gains (or keeps) access to personal data, or personal data are lost due to a security incident.
  • Inability for data subjects to exercise their rights: for example, data subjects have not been (well-)informed about data processing, there is no contact person to ask for data removal, or there is no procedure in place to find, correct or remove data subjects’ data.
  • Intrusion of personal space: for example, you observe data subjects in a place or at a time where/when they would expect a sense of privacy (e.g., dressing rooms or at home). If there is secret or excessive observation, people may feel violated and stifled.
  • Inappropriate outcomes: the outcomes of your research project may also impact data subjects, for example when it induces discrimination, inappropriate bias, (physical or mental) health effects, but also when a lack of participation denies data subjects beneficial treatment effects.
  1. Outline which and how much (personal) data you use, how, and for what purposes
    This is usually one of the first steps of a privacy scan.
  2. Is there a project with similar data, purposes, methods and techniques?

    If there are projects that are the same or very similar to your project, you can reuse relevant work from their privacy scan, or if applicable, Data Protection Impact Assessment (DPIA). Naturally, you should adjust sections that do not apply in your own project. If you’re not sure of any existing projects similar to yours, ask your privacy officer or colleagues.

  3. List possible harm to data subjects and others
    Make an overview of the possible harm that could occur to data subjects and others if any of the risk scenarios occurs. These could be:
    • Physical harm
      Damage to someone’s physical integrity, such as when they receive the wrong medical treatment, end up as a victim of a violent crime, or develop mental health problems such a depression or anxiety.
    • Material harm
      Destruction or property or economic damage, such as financial loss, career disadvantages, reduced state benefits, identity theft, extortion, unjustified fines, costs for legal advice after a data breach, etc.
    • Non-material harm
      • Social disadvantage, for example damage to someone’s reputation, humiliation, social discrimination, etc.
      • Damage to privacy, for example a lack of control over their own data or the feeling of being spied on. This can happen when you collect a lot of personal data, or for a longer period of time (e.g., with surveillance, web applications).
      • Chilling effects: when someone stops or avoids doing something they otherwise would, because they fear negative consequences or feel uncomfortable.
      • Interference with rights: using personal data may violate other fundamental rights, such as the right to non-discrimination or freedom of expression.
  4. Estimate the risk level without safeguards
    After listing the possible harm, you should determine the risk level of each harm occurring. The risk level depends on:
    • the impact of the harm: what is the effect of each of the 4 scenarios above on the data subject and others (major, substantial, manageable, minor)?
    • the likelihood of the harm occurring: this depends on the circumstances of your project, such as: what and who can cause the harm to occur? How easily are mistakes made (e.g., how easily will an unauthorised person gain access)?

    It is important to first determine the risk level in case you do not implement any safeguards. This will be your risk level if all those safeguards fail. The higher this initial risk, the more you should do to mitigate it.

  5. Determine the safeguards you can use to mitigate the risks

    In many cases, it is possible to mitigate the risks by implementing organisational and technical measures. The higher the risks, the more and/or stricter measures should be in place to mitigate them. You can find some relevant measures in the Privacy by Design chapter, and on the example page in this chapter.

  6. Determine the residual risk after implementing safeguards
    By implementing safeguards, you are decreasing the likelihood of the risks occurring. If the risk is still unacceptably high, even after implementing safeguards, you should:
    • Modify your processing to reduce the impact of potential damages (for example, refrain from collecting specific data types), or
    • Implement more or better measures, reducing the likelihood of any harm occurring.

It will always be difficult to quantify risks. Therefore, it is largely the argumentation that can provide context in how the risk level was determined. The same harm may in one project be very unlikely to occur, while in another it may be very likely: context matters!

What are high-risk operations?

On this page: high-risk, large risk, dpia, assessment, mandatory

The GDPR requires a Data Protection Impact Assessment (DPIA) to be conducted when the risks in your project are high, considering “the nature, scope, context and purposes” of your project (art. 35(1)). More practically, you need to do a DPIA when two or more of the criteria from the European Data Protection Board apply to your project, or – if the processing occurs in the Netherlands - when one or more of the criteria from the Dutch Data Protection Authority (English UU translation) applies to your project.

Examples of high-risk scenarios

You systematically use automated decision making in your project (art. 35(3))

For example:

  • You use an algorithm to analyse health records and predict patients’ risk of complications.
  • You use an algorithm to analyse students’ test scores and learning patterns, to make personalised recommendations for coursework or additional resources.
  • You use an algorithm to detect fraudulent activity.
You process special categories of personal data or criminal offense data on a large scale (art. 35(3))

For example:

  • You amplify bodily materials into pluripotent stem cells, cell lines, germ cells or embryos (see the Dutch Code of Conduct for health research, 2022).
  • You analyse social media data to study political opinions and religious beliefs.
  • You investigate criminal records from all currently incarcerated individuals (note that such a project is likely subject to additional restrictions).
You publicly monitor people on a large scale (art. 35(3))

For example:

  • You use traffic data and GPS devices to monitor people’s behaviour in traffic.
  • You use CCTV footage to study public safety.
You collect a lot of personal data, or from a large group of people (EDPB, 2017)

For example:

  • You collect data on psychosocial development in twins annually for over a decade.
  • You collect genomic data to study the genetic basis of a specific disease.
  • You keep a database with contact information from thousands of people.
You use new techniques or methods for which the effects on data subjects or others are not yet known (EDPB, 2017)

For example:

  • Machine learning algorithms.
  • Internet of Things.
  • Virtual or Augmented Reality.
  • Natural Language Processing.
  • Human-computer interaction.
Your research involves groups that are vulnerable or touches a vulnerable topic (EDPB, 2017)

For example:

  • You perform video interviews with children talking about abuse.
  • You interview refugees about their home country.
  • You perform in-depth interviews with employees about their job satisfaction.
  • You perform a diary study among mentally ill patients.
  • You collect data from homosexual individuals in a country where homosexuality is forbidden or can lead to discrimination.
  • You perform research among a population with (severe) distrust towards scientific research(ers) or who have difficulty understanding your research.
There is a high chance of incidental findings in your research (Dutch Code of Conduct for health research, 2022)

For example:

  • You collect neuroimaging data from patients who likely have a brain tumour.
  • You investigate genetic data from vulnerable subjects that indicates a risk for disease.

When you suspect that you may need a DPIA, or when you are not certain whether your project needs one, please contact your privacy officer.

Data classification

On this page: BIV classificatie, CIA triad, data classification, information security, IT system

In order to determine which IT solutions are suitable for processing personal data (e.g., storage or analysis platforms), a classification of your data is needed. That data classification can then be paired to the classification given to IT solutions. Institutes will determine for which data classification certain IT solutions are suitable. For example, at Utrecht University (UU), the classification levels are: low, basic, sensitive or critical. If your data are classified as “critical”, you are not allowed to use an IT solution that is only suitable for “sensitive” data.

To classify data, you determine how important it is to keep the data Confidential, correct (Integrity), and Available. Below you can find some guidance on determining the risk level for each of these. Note that this guidance is based on the UU data classification, but your institute may adhere to a different form of the classification.

Data classification can be done for all types of data, not only personal data. Personal data would simply score “higher” on the Confidentiality aspect.

Classification levels

Confidentiality

How confidential are the data?

  • Low:
    • Anonymous data, or data that are already publicly available, from less than 50 people.
    • Direct colleagues.
    • No third parties and software involved.
    • No reputation loss when data are lost.
  • Basic:
    • Non-public basic personal data such as name, (email)address, etc.
    • Personal data obtained directly from data subjects.
    • Personal data from a moderate number of data subjects (> 50 - 200).
    • Sensitive personal data from a small number of individuals.
    • Third parties are involved but they are located inside the EEA.
  • Sensitive:
    • A data leak would lead to reputation damage to you and the university.
    • You are bound to patents or contractual agreements.
    • Sensitive personal data from a moderate number of data subjects (e.g., personality data, financial data).
    • Non-sensitive personal data from a large number of data subjects (> 10.000).
    • Personal data enriched with external resources.
    • Far-reaching process automation.
    • Non-targeted monitoring.
    • Relatively new technology.
  • Critical:
    • Any project that carries high risks for data subjects or others:
      • Highly sensitive personal data (e.g., biometric identification data, genetic data).
      • Personal data from a very large number of data subjects (> 50,000).
      • Vulnerable subjects (e.g., minors, disabled, undocumented, persecuted groups).
      • Processing happens (partly) outside of the EEA without an adequacy decision.
    • Life-threatening research.
    • There are far-reaching contractual obligations.
    • A data leak would lead to exclusion from future grants.
Integrity

How important is it that the data are correct and can only be modified by authorised individuals?

  • Low: Incorrect data would be an inconvenience and/or require some rework.
  • Basic: Incorrect data would invalidate research and/or require significant rework.
  • Sensitive: Incorrect data would invalidate multiple research projects, could cause reputational damage to you and the university, or lead to significant contractual violations.
  • Critical: Incorrect data could have far-reaching contractual obligations, exclusion from future grants or life-threatening research.
Availability

How important is it that the data are available? When would it be a problem; if the data are not available for an hour, a day, a week…?

  • Low: Losing (access to) the data would be inconvenient and/or lead to rework.
  • Basic: Losing (access to) the data would invalidate research and/or require significant rework. Not having access to the data would cause significant delays and could incur costs up to 250.000 EUR.
  • Sensitive: Losing (access to) the data would terminate or hugely delay multiple research projects, could cause significant reputational damage to you and the university, lead to significant contractual violations or individuals not being able to access their sensitive personal data.
  • Critical: Inaccessible data could have far-reaching contractual obligations, cause damages in excess of 1.500.000 EUR, including exclusion from future grants or losing/not being able to access potentially life-threatening data.

Please note that a classification may be lower or higher than indicated in the examples, depending on your specific context. Please contact your privacy officer to help you classify your data. You can also contact Information Security for questions about data classification and security measures.

Examples of risks and how to mitigate them

On this page: risk example, safeguards, organisational and technical measures, protection, protective, security, data breach

Below you can find a list of common privacy and security risks in research and how you can mitigate them:

Unwarranted access to personal data

Someone tries to gain access to personal data
A previous team member still has access (e.g., a copy on their personal device, a working account)

Enforce a protocol in which team members who leave need to remove all their copies of the data and are denied access to the data and shared folders (on- and offboarding). Periodically review and update all users/rights. Make someone responsible for this process.

A team member shares the data with a third party
  • Put in place a protocol or non-disclosure agreement that makes team members aware that this is not allowed, or make sure that a data transfer agreement is in place.
  • Make sure that team members do not have access to data that they do not need access to.
A password is leaked
  • Use systems that apply multifactor authentication.
  • Change your password regularly or immediately when it is compromised, and have your team members do the same.

Back to top

Loss of personal data

A device is lost or defective (e.g., laptop, USB stick)
  • Protect the device with a password.
  • Encrypt the device or the data on it.
  • Delete unnecessary copies of the data on the device as soon as you’ve made a back-up on a more stable and secure system, such as university-managed storage facilities.
  • Enable removing data from the device from a distance.
Paper data are lost
  • Avoid collecting data on paper altogether, or only collect the necessary information.
  • Store the paper data in a central and access-controlled location, scan the documents as soon as possible, store the scans on a backed-up storage medium and destroy the paper records (securely).
The dataset is deleted accidentally

Use a storage system that has back-up functionality, or if not available, make regular manual back-ups of the data.

A system error causes temporary loss of or access to data
  • If you are not using centrally managed IT solutions, regularly check if back-ups are being done as expected and have protocols in place on how to restore back-ups.
  • If the time-out takes a significant amount of time, discuss with your privacy officer whether you need to inform data subjects about it: they cannot exercise their rights during that time.
The organisation is hit by a ransomware attack

Enforce a security protocol that emphasises secure data practices, such as:

  • Do not download data from unknown sources.
  • Be careful when installing software, preferably only install software from the institutional software catalogue.
  • Create awareness of what phishing looks like and to report phishing immediately to the Computer Emergency Response Team.

Back to top

Unintended collection of personal data

Data subjects give more, or more sensitive information about themselves than intended/needed
  • Offer data subjects the possibility to review what information they provided.
  • Offer the possibility to withdraw consent in a later stage.
  • Use a data collection protocol to prevent this from taking place.
  • Remove the unnecessary information from your dataset.
Data subjects give (sensitive) information about others
  • Use a data collection protocol to prevent this from taking place.
  • Offer data subjects the possibility to review what information they provided.
  • Remove the unnecessary information from your dataset.
  • Consider the risks for those others vs. your own research benefits: if the interests for the other people are more important, you should delete or anonymise the information.
Personal data are collected unintendedly

This can happen when a survey tool automatically collects additional data such as IP addresses. You can sometimes turn this off, and otherwise must remove the data as soon as possible after collection.

Back to top

Risks for data subjects

Your research has a stigmatising effect on the data subjects due to incorrect, unclear or opaque selection criteria

Describe clearly how the data subjects are selected.

Due to a small sample size, data subjects are easily identifiable
  • Increase the sample size
  • Put in place protection measures to protect the identity of the data subjects
Data subjects put themselves in harm’s way by participating
  • Balance the interests of the data subjects vs. those of your research project and go through ethical review.
  • Collect the data in a physically safe location.
  • Put in place protection measures like anonymisation, minimisation, blurring, etc. to hide and protect the identity of the data subjects.
  • Clearly inform data subjects what their participation entails and obtain their explicit consent.
  • If applicable, inform local authorities and obtain formal permission to perform your research.

Back to top

(PART*) Techniques & Tools

Research scenarios

Pseudonymisation & Anonymisation

On this page: anonymous, pseudonymous, deidentification, safeguard, protection measure, sdc, statistical disclosure control

Pseudonymisation and anonymisation are both ways to make personal data less easily linkable to individual data subjects: they are methods to de-identify personal data. Importantly, whereas anonymisation results in non-personal data that are not subject to the GDPR anymore, pseudonymised data are still personal data. It is therefore important to understand the difference between the two, and to estimate when your data are indeed fully anonymous.

Any operation that you do up until the personal data are anonymised - including the anonymisation itself - is still subject to the GDPR. So even if you can anonymise your data later, you still need to comply with the GDPR for everything you do beforehand (e.g., collecting, analysing, sharing, etc.).

In this chapter, we:

  1. Explain what pseudonymisation and anonymisation mean.
  2. Present a step-by-step workflow to de-identify personal data.
  3. List a number of techniques that you can use to de-identify personal data.

Finally, we list some resources for further reading.

What are pseudonymisation and anonymisation?

On this page: anonymous, pseudonymous, deidentification, safeguard, protection measure, identifiable, sdc, statistical disclosure control, disclosure risk

Pseudonymisation

Pseudonymisation is a safeguard that reduces the linkability of your data to your data subjects (rec. 28). It means that you de-identify the data in such a way that they can no longer lead to identification without additional information (art. 4(5)). In theory, removing this additional information should lead to anonymised data.

Pseudonymisation is often interpreted as replacing direct identifiers (e.g., names) with pseudonyms, and storing the link between the identifiers and the pseudonyms in a key file, separated from the research data. While this is a good practice (it makes sure that data are not directly identifiable anymore), this interpretation of pseudonymisation does not take into account indirectly identifiable information, and thus does not necessarily fulfil the GDPR’s definition of pseudonymisation!

Pseudonymous data are still personal data and thus subject to the GDPR. This is because the de-identification is reversible: identifying data subjects is still possible, just more difficult. This means that in order to use pseudonymous data, you still need to comply to all the rules in the GDPR.

Anonymisation

Anonymisation is a de-identification process that results in data that are “rendered anonymous in such a manner that the data subject is not or no longer identifiable” (rec. 26), neither directly nor indirectly, and by no one, including you. When data are anonymised, they are no longer personal data, and thus no longer subject to the GDPR. Note, however, that everything you do before the data are anonymised (including the anonymisation itself) is subject to the GDPR!

Anonymisation is very difficult to accomplish in practice! This video nicely illustrates why.

The identifiability spectrum

The relationship between (identifiable) personal data, pseudonymous data and anonymous data should be seen as lying on a spectrum. The more de-identified the data are, the closer they are to anonymous data and the lower the risk of re-identification. The visual guide below nicely illustrates this:

If the image does not show correctly, view it online

When are data anonymous?

Your data can be considered anonymous if data subjects can only be re-identified with an unreasonable amount of effort, i.e., taking into account the costs, required time and technology, and future technological developments (rec. 26).

Basically, your data are not anonymous (personal) when they comply with any of the characteristics of personal data:

  • There is directly identifiable information (e.g., name, email address, social security number, etc.).
  • Data subjects can be singled out (i.e., you can tell one data subject from another within a known group of data subjects).
  • It is possible to identify data subjects by linking records (“mosaic effect”), either within your own database or when using other data sources.
  • It is possible to identify a data subject by inferring information about them (e.g., infer a disease by the variable “medication”), either within your own database or when using other data sources.
  • It is possible to reverse the de-identification.

Whether data can be seen as anonymous strongly depends on the context of your research and how much information is available about the data subjects.

Comic about anonymous data. The left pane shows an animal that says 'Don't 
worry! We will only save general information about you, not anything that could 
identify you!', while holding a paper that says 'Brown chicken'. Next to the 
animal is a brown chicken giving a 'thumbs up'. The right pane shows that one 
brown chicken in a crowd of white chickens.

When collaborating with research data centres, such as the Statistics Netherlands (Centraal Bureau voor de Statistiek, CBS), often output checking guidelines are used to determine the risk of identification resulting from the analysis output of sensitive data.

Alternatives to anonymisation

Anonymisation is not the only solution. The best way to protect data subjects’ privacy is to only collect/process their personal data if necessary (minimisation). Additionally, in many cases, full anonymisation is not even possible or desirable, for example if it results in too much information loss or incorrect inferences.

If you cannot anonymise the data, there are always other ways in which you can protect the data, such as:

Step-by-step de-identification

On this page: anonymous, pseudonymous, step-by-step, workflow, deidentification, safeguard, protection measure

Below is a step-by-step workflow that you can use to de-identify your data. Whether or not the de-identification results in a pseudonymised or an anonymised dataset is highly dependent on the characteristics of the dataset and the context in which it was obtained.

  1. Perform the de-identification in a safe storage or processing environment: remember that you are working with personal data, and as long as the data are not anonymous, they will be subject to the GDPR!

  2. Identify any potentially identifying information in your data.

  3. Assess whether you need to collect this information at all. For example:
    1. Do you really need IP addresses in your survey data?
    2. Do you really need to record audio or video?
    3. Do you really need a consent form with a name, contact information, and signature on it?
    4. Replace names with pseudonyms in filenames and within the data where possible.

  4. If you do not need directly identifying information to answer your research question, but you do need it to, for example, contact data subjects:
    1. Separate directly identifying information from the research data.
    2. Use pseudonyms to refer to individuals instead of names.
    3. Create a keyfile to link the pseudonyms to the names.
    4. Store the directly identifiable information and the keyfile in a separate location from the research data and/or in encrypted form.

  5. Consider which types of information may lead to indirect identification, such as demographic information (age, education, occupation, etc.), geolocation, specific dates, medical conditions, unique personal characteristics, open text responses, etc.

  6. De-identify the directly and indirectly identifiable data using (a selection of) the techniques described on the next page.
    1. Before you start, save a copy of the raw, untouched dataset, in case anything in the process goes wrong.
    2. Document the steps you took, for example in a programming script or README file, which always accompanies the data.
    3. Whether you can delete the raw (non-pseudonymised) version of the dataset, depends on whether it needs to be preserved for verification purposes. Specific restrictions may also apply if the Dutch Medical Research Involving Human Subjects Act (WMO) and/or Good Clinical Practice apply to your research.

  7. Treat the data according to their sensitivity. If the data are not fully anonymised, they are pseudonymous and thus still need to be handled according to the GDPR guidelines!

How de-identified is de-identified enough? You can read more about this in the chapter Statistical approaches to privacy.

De-identification techniques

On this page: anonymous, pseudonymous, deidentification, safeguard, protection measure, technique, anonymisation method, privacy-preserving, privacy-enhancing, sdc, statistical disclosure control, disclosure risk

Below is a list of techniques you can apply to your data to de-identify your dataset so that it results in a pseudonymised, or possibly even anonymised dataset. Bear in mind that applying these will always result in loss of information, so ask yourself how useful your dataset will still be after de-identification.

The techniques are:

Statistical Disclosure Control (SDC)
The below de-identification methods are sometimes also referred to as methods to apply Statistical Disclosure Control (SDC). You will most likely encounter SDC when you collaborate with a research data centre such as Statistics Netherlands (Centraal Bureau voor de Statistiek, CBS).

Suppression

Suppression (sometimes called “masking”) basically means removing variables, (parts of) values, or entire entries that you do not need from your dataset. Examples of data that you could consider removing:

  • Name and contact information
  • (Parts of) address
  • Date, such as birthdate or participation date
  • Social security number/Burgerservicenummer (BSN). NB. In the Netherlands, you are not allowed to use BSN in research at all!
  • Medical record number
  • IP address
  • Facial features from neuroimaging data
  • Automatically generated metadata such as GPS data in an image, author in a document, etc.
  • Participants that form extreme outliers or are too unique

Generalisation

Generalisation (also sometimes called abstraction, binning, aggregation, or categorisation) reduces the granularity of the data so that data subjects are less easily singled out. It can be applied to both qualitative (e.g., interview notes) and quantitative data (e.g., variables in a dataset). Here are some examples:

  • Recoding date of birth into age.
  • Categorising age into age groups.
  • Recoding rare categories as “other” or as missing values.
  • Replacing address with the name of a neighbourhood or town.
  • Generalising specific persons in text into broader categories, e.g., “mother” to “[woman]”, “Bob” to “[colleague]”.
  • Generalising specific locations into more general places, e.g., “Utrecht” to “[home town]”, or from point coordinates to larger geographical areas (e.g., polygon or linear features).
  • Coding open-ended responses into categories of responses, or as “responded” vs.  “not responded”.

Replacement

In this case, you replace sensitive details with non-sensitive ones, which are usually less informative, for example:

  • Replacing directly identifying information that you do need with pseudonyms. When doing this, always store the key file securely and separately from the research data (e.g., use access control, encryption). If you do not need the links with direct identifiers anymore, remove the keyfile or replace the pseudonyms with random identifiers without saving the key.
    A good pseudonym:
    • Is not meaningful with respect to the data subjects: a random (unique) number or string is better than a code that contains parts of personal information, because the latter may reveal details about data subjects.
    • Is managed securely, for example by appointing someone to be responsible for managing access to the keyfile.
    • Can be a simple number, random number, cryptographic hash function, text string, etc. (read more).
  • Replacing identifiable text with “[redacted]”. When redacting changes in-text, never just blank out the identifying value, always put a placeholder or pseudonym there, e.g., in [square brackets] or <seg>segments</seg>.
  • Replacing unique values with a summary statistic, e.g., the mean.
  • Rounding values, making the data less precise.
  • Replacing one or multiple variables with a hash.
    What is hashing?

    Hashing is a way of obscuring data with a string of seemingly random characters with a fixed length. It can be used to create a”hashed” pseudonym, or to replace multiple variables with one unique value. There are many hash functions which all have their own strength. It is usually quite difficult to reverse the hashing process, except if an attacker has knowledge about the type of information that was masked through hashing (e.g., for the MD5 algorithm, there are many lookup tables that can reverse common hashes). To prevent reversal, cryptographic hashing techniques add a “salt”, i.e., a random number or string, to the hash (the result is called a”digest”). If the “salt” is kept confidential or is removed (similar to a keyfile), it is almost impossible to reverse the hashing process.

Top- and bottom-coding

Top- and bottom-coding are mostly useful for quantitative datasets that have some unique extreme values. It means that you set a maximum or minimum and recode all higher or lower values to that minimum or maximum. For example, you can top-code a variable “income” so that all incomes over €80.000 are set to €80.000. This does distort the distribution, but leaves a large part of the data intact.

Adding noise

Adding noise to data obfuscates sensitive details. It is mostly applied to quantitative datasets, but can also apply to other types of data. For example:

  • Adding half a standard deviation to a variable.
  • Multiplying a variable by a random number.
  • Applying Differential Privacy guarantees to an algorithm.
  • Blurring (pixelating) images and videos.
  • Voice alteration in audio.

Permutation

Permutation means swapping values between data subjects, so that it becomes more difficult to link information belonging to one data subject together. This will keep the distribution and summary statistics constant, but change correlations between variables, making some statistical analyses more difficult or impossible.

Tools and further reading

On this page: anonymous, pseudonymous, deidentification, safeguard, protection measure, tool, resource, reading material

You can find a selection of de-identification tools in this GitHub repository.

For further reading, we compiled a reading list on this topic in our publicly accessible Zotero library.

We can recommend:

Statistical privacy

Statistical disclosure control

K-anonymity and its descendents

Differential privacy

Secure computation

On this page: data-to-code, code-to-data, tools-to-data, algorithm-to-data, cryptography, technique, tool, computing, computation, analysis, analyse, distributed analysis

When you use personal data in your research project, you likely also need to analyse those data, often using a script of sorts. In this chapter, we discuss the following scenarios for analysing personal data:

  1. “Regular” data analysis (“data-to-code”), where the data are brought to the “script” or analysis software in order to analyse them.
  2. “Code-to-data” scenario, where a script or analysis software is run on the data, without moving the data elsewhere.
  3. Federated analysis scenario, where a script or analysis software runs on multiple datasets that are in different locations, without moving those datasets elsewhere.

Additionally, we discuss relatively new cryptographic techniques that can be used in securing the analysis of personal or otherwise sensitive data.

Which scenario should I choose?

Which scenario is suitable to apply in your project depends on, among others:

  • Your dataset: does it contain personal data? How large is the dataset? Do you know the data structure and analysis method beforehand?
  • Which computing facility is most suitable:
    • Local (e.g., laptop), on campus (e.g., cluster at Geosciences), from a national trusted party (e.g., SURF), or external (e.g., Amazon, Microsoft)?
    • Located in the Netherlands, Europe or in a non-EEA country?
    • Small or large amount of computing power (CPUs/codes/threads or GPUs, memory size, disk space, etc.)?
  • Which software you need to run on the data using the computer power, e.g., R, Python, SPSS, or any other scripting language.
    • Does the software require root user access to install and/or configure?
    • Does the software require paid licenses (e.g., MATLAB)?
    • Can the software be installed in advance, or does it need to be updated during analyses (e.g., with additional packages from a repository)?
  • Whether and with whom you are collaborating on your project.

Tools and support

We have created an overview of secure computing software and services in this GitHub repository. Keep in mind that this is by no means a complete list!

If you work at Utrecht University, you can ask the Research engineering team for help with choosing a suitable computing solution. If you have already chosen a solution, but are not sure whether it is safe to use, you can contact Information Security or your privacy officer for help.

“Regular” data analysis: data-to-code

On this page: analysis, data-to-code, data-to-script, transfer, sharing

In this scenario, you transfer the data to a computing facility, and run an analysis (script) on the data.

In the most basic variant, this computing facility consists of your work computer or faculty computing cluster, where you do not transfer the data outside of your organisation for the analysis. In other cases, data need to be transferred to a computing facility outside your organisation, such as high-performance clusters from SURF, Microsoft, Amazon, etc.

When to use

If you have a relatively small dataset, the “data-to-code” scenario is the most common and flexible scenario:

  • It allows you to choose a computing facility that is best suited to your situation.
  • It allows you to interactively read, analyse, export and transport the data you want.

Disadvantages of this approach can be:

  • When transferring the data to a computing facility, often new copies of the data are created, which can make it more difficult to keep track of different versions of the data.
  • Transferring data always comes with additional risks of a data breach. Besides protection during data storage, it is therefore crucial to also protect the data during the transfer to the computing facility, and when used at the computing facility itself.
  • The way the data are transferred to the computing facility is not always as straightforward, especially if you have a large dataset.

Implications for research

In this scenario, you need to make sure that:

  • You apply data minimisation, access control, and, if applicable, pseudonymisation and other protective measures to limit the amount of personal data that is transferred to the computing facility.
  • The data are also protected during the transfer to the computing facility (e.g., your work laptop or an external solution), for example through encryption.

Additionally, if the computing facility is provided by an external processor (e.g., SURF, Amazon):

  • A data processing agreement with the provider of the computing facility is needed. If there is none, you cannot use the computing facility to analyse personal data.
  • The computing facility should be suitable (secure enough) for the sensitivity level of your (personal) data. For example, if your data are “critical” in terms of confidentiality, the computing facility should also have that “critical” classification.

Examples

  • You use your faculty’s high performance cluster to analyse a dataset that you collected at your organisation.
  • You use the High Performance Computing platform from SURF to analyse a large dataset that you collected at your organisation. In this case, a data processing agreement between your organisation and SURF is needed to make sure that your organisation remains in control of the personal data at SURF’s servers.
  • You use Amazon Web Services (AWS) to analyse a large dataset that you collected at your organisation. In this case, a data processing agreement between your organisation and AWS is even more important, because Amazon has servers that are located outside of the European Economic Area.

Code-to-data (one data provider)

On this page: code-to-data, script-to-data, algorithm-to-data, tools-to-data, SANE, digital research environment, secure research environment, virtual research environment, access control

In this scenario, an analysis is run on data without transferring the data outside of the organisation In many cases, only the results of the analysis can be exported, and not the data.

We distinguish the following versions of this scenario:

‘Tinker’ version: interaction with the data
In the Tinker version, users can log in to the computing facility and directly interact with the data, but there may be technical limitations on the import and export of the data. Procedural limitations should be posed through agreements with the user. This version can be implemented in multiple ways, such as:
  • Accessing and analysing locally stored data on premises. An example is analysing highly sensitive data in a dedicated room without an internet connection.
  • Accessing locally stored data through remote desktop. This usually does not impose technical limitations on what can be done with the data.
  • Virtual Research Environments (VREs) are temporary facilities where you can interactively perform computations on data in the cloud. In this case, it is sometimes possible to impose technical limitations on what can be done with the data (in which case these are called “Trusted Research Environments”). Examples of VREs are SURF Research Cloud and anDREa.
‘Blind’ version: remote execution

In the Blind version, users do not have access to the data at all, and only receive the results of an analysis, after reviews by the data owner(s) to ensure that the results do not contain sensitive details. In this case, a synthetic dataset can be provided to write and test the analysis script on, before it is run on the real dataset. This “blind” version could be run in a dedicated environment where researchers can upload their script, but can also be implemented manually, for example when a researcher sends a script by email to be run on a dataset, and receives the results back via email as well (i.e., this is possible when neither the script nor the results contain any sensitive details).

At the moment, both the Tinker and Blind versions of this code-to-data scenario are being developed as virtual research environments in the Secure ANalysis Environment project (SANE).

When to use

Reasons to use this scenario include:

  • You want to retain control over the data, e.g., to prevent any unnecessary copies from being made (data sovereignty).
  • You do not want, or are not allowed to transfer the data, because they contain personal data or intellectual property.
  • The dataset is too large to transfer.
  • In the ‘Blind’ version: You want to be sure that the analysis results do not contain any sensitive details.

Implications for research

Compared to the “data-to-code” scenario, the code-to-data approach offers more control over the data, but often requires more, sometimes manual, work, such as:

  • Checking the credentials of a user: can they be trusted? An agreement with the user may be desirable or even required. In SURF Research Cloud, credentials can be checked using SURF Research Access Management.
  • Preparing a protected computing environment that a user can use.
  • In the ‘Blind’ version:
    • Creating a synthetic dataset.
    • Reviewing the output of the script for sensitive elements. This requires the right expertise.
    • Reviewing whether the code that is run on the data is privacy-preserving. This also requires the right expertise.

It is essential to have a well-described workflow to use this scenario, to ensure confidentiality of the personal data. Additionally, dedicated personnel may make the process easier and consistent.

Examples

  • A research team needs to process a dataset containing health data to determine the number of Covid-19 patients at a certain hospital. The hospital providing this dataset does not allow transferring the dataset, but they do allow to run scripts on the dataset. To make that possible, the hospital provides a computing facility, owned by the hospital, to run scripts from research teams. In addition, for each result, the hospital staff inspects if it contains personal data, and if not, it will be passed onto the research team. Since a result like “100 patients at this hospital have had Covid-19 in 2021” does not contain personal data, it can be safely passed to the research team.
  • In the data donation approach, the software PORT can be run on data subjects’ locally stored data, and only the results of that analysis can be shared with the researcher if allowed by the data subject. Note however that the sensitivity of the results fully depend on the analysis that was run.

Federated analysis

On this page: federated analysis, federated learning, machine learning, distributed analysis, distributed learning, collaboration, harmonisation

Federated analysis is an extension to the code-to-data scenario, where the data of interest are owned by multiple organisations. In this scenario, the data remain with multiple data providers, and the script “travels” across those data sources, combining the results in a central location, and only sharing the results of the analysis. If necessary, there are techniques to hide intermediate results (which could also reveal sensitive information). If the script in question is a machine learning model, then this technique is called “federated (machine) learning”. You can learn more about federated analysis in this article.

When to use

Federated analysis is useful when there are multiple data providers who do not allow transferring their data outside of the organisation, or whose data are simply too large to share.

Implications for research

  • A prerequisite for analysing data in this way is often that the data at the different providers are similarly structured and use similar terminology (e.g., making sure that every party uses “male”, “female”, and “other” as levels for the variable Gender, instead of “girl” and “boy”, or 0 and 1).
  • Federated analysis works best for “horizontally partitioned” datasets, where different organisations have the same (types of) information, but from different people. It is not well-suited for “vertically partitioned” datasets, where the different organisations have different (types of) information on the same people and thus want to link those different data sources.
  • Setting up the infrastructure for federated analysis is challenging and can take a large amount of time (software installation, access rights, linking datasets, etc.). It is wise to first investigate whether this option is indeed the most suitable for your project.

Examples

  • A research team needs access to various datasets containing health data to determine which factors contribute to health of Covid-19 patients at various hospitals. Each dataset contains health data from patients of the hospital where they are treated. Since each dataset contains sensitive personal data, it is not desirable to store these datasets in a central location to combine them. To be able to answer the research question, one needs to access each dataset separately and combine the results of each dataset. To make this possible, each hospital provides a computing facility. The research team submits their script to each of the computing facilities, where it is run on the local dataset. After a check by each hospital’s staff that the results do not contain any sensitive details, the results of the individual computations are combined centrally into one result. In the example, the result of the calculation at each hospital is a prediction model for Covid-19 patients, and the individual models are combined to create a more reliable prediction model.
  • Several university medical centres use the Personal Health Train from Health-RI, which relies on the vantage6 software.
  • DataSHIELD is an infrastructure and a series of R packages that allows to co-analyse data hosted at different organisations. It requires harmonising the data at the different organisations and setting up the DataSHIELD infrastructure.

Cryptographic techniques

On this page: encryption, cryptography, security, collaboration, confidential computing, mpc, homomorphic encryption

Besides the scenarios described previously, there are also multiple cryptographic techniques that can be applied to protect sensitive data in the analysis phase. Here, we discuss secure multiparty computation, confidential computing, and homomorphic encryption.

Although there is some overlap in functionality and purpose between these three techniques, they are generally still considered to be distinct and can be combined to enhance security.

These cryptographic techniques are relatively new and are not available as distinct services (yet) for direct application in research. They are for now listed here for information purposes.

Secure multiparty computation

Secure multiparty computation (also referred to as “MPC”) is a set of cryptographic techniques that allows multiple parties to jointly perform analyses on distributed datasets, as if they had a shared database, and without revealing the underlying data to each other. Among those techniques are secure set intersection (securely investigating which elements multiple databases have in common), homomorphic encryption (see below), and others.

When to use

The benefits of MPC are that no raw data are shared between the parties, computations are guaranteed to perform correctly, and there is a degree of control on who receives the result of the computation (i.e., the results are not necessarily combined in a central location). MPC is therefore a good way of implementing Privacy by Design into your project when you work with personal data.

Contrary to federated analysis, MPC is suitable for linking “vertically partitioned” datasets, i.e., when different organisations have different (types of) information on the same people and thus want to link those different data sources.

Implications for research

  • The computation in MPC is really joint: you need to have agreed on a specific analysis to be performed and what you will reveal as result of the computation.
  • There is no one-size-fits-all MPC solution: different use cases ask for different implementations of MPC.
  • Additional computational resources are required to generate random secrets and distribute data over the multiple parties.

Example

  • MPC was used by a medical insurance company and hospital to determine the effectiveness of a personal lifestyle app for diabetes. In this example, it was possible to calculate average medical cost for different patient groups, based on whether they used the app or not, without revealing patient information between the insurance company and the hospital.
  • You can find a simplified example on jointly calculating average income here.

You can find more information about secure multiparty computation on https://securecomputation.org/, in this report, and on the website of TNO.


Confidential computing

Confidential computing is a technique that protects data in use through a (hardware-based) Trusted Execution Environment (TEE). This environment makes sure that data within it are kept confidential (data confidentiality) and that both the data and the code running in the TEE cannot be modified or deleted (data and code integrity). The TEE uses embedded encryption keys and makes sure that the analysis stops running when malware or unauthorised access is detected. Moreover, data and code are even invisible to the operating system, cloud provider and any virtual machines.

There are many possible applications of this technique, for example:

  • You want to protect against unauthorised access during the analysis of sensitive data.
  • You want to analyse sensitive data, and it is necessary to use an untrusted cloud platform or infrastructure.
  • You want to prevent the analysis script from leaking or manipulating data.

It is important that confidential computing is used together with encryption of data at rest and in transit, with restricted access to the decryption keys. It also requires the TEE to be trustworthy (attestation), which is an active field of study. You can read more on the website of the Confidential Computing Consortium.


(Fully) homomorphic encryption

Where “regular” encryption focuses on data at rest (e.g., in storage) or data in transit (e.g., when transferring data), homomorphic encryption allows analyses to be performed on encrypted data (“data in use”). During the analysis, both the data and the computation result remain encrypted, unless they are decrypted by the decryption key owner. This technique can be applied both in confidential computing and in secure multiparty computation.

There are multiple types of homomorphic encryption: partial, somewhat partial and fully homomorphic encryption. The latter is the most promising solution, as it allows an infinite number of additions and multiplications to be performed on the encrypted data.

Currently, the practical use of homomorphic is limited, because it can require a lot of computational resources to use it, causing it to be relatively slow. New implementations are however being developed, see this website for a list of available implementations. Another limitation is that there is no interaction with the data during the analysis, and so you cannot check whether the analysis was successful. To solve this, you could use a synthetic dataset to develop and test your algorithms first.

Other techniques

Encryption

Synthetic Data

Data donation

Tools & Services

There are many tools that you can use to work with personal data, such as tools to collect, store, and analyse personal data, but also to deidentify, encrypt and synthetise datasets. In this chapter, we provide resources to identify the tool you are looking for.

In short:

Are you developing a website or application yourself that uses user data? Check out the CNIL GDPR Guide for Developers for step-by-step guidance on how to develop your software in compliance with the GDPR.

Utrecht University tool finders

On this page: storage, storing, collection, repository, data archive, sharing, tools, services

When you are using a tool that processes personal data, that tool should do so in compliance with the GDPR. If you work at Utrecht University (UU), you can use https://tools.uu.nl to find:

  • tools that are safe to use in the Tooladvisor. These include tools for data collection, file sharing, audio transcription, and more. Most of the tools listed in the Tooladvisor are safe to use either because no (personal) data are being used by the tool, data are processed at UU premises, or because of a Data Processing Agreement between UU and the supplier of the tool, in which the supplier agreed to sufficiently protect the data entered into their tool.
  • all storage facilities provided by UU.
  • a selection of possible data repositories to publish (meta)data in.

Additionally, you can find available software via this intranet page

Tools to deidentify, synthetise and work safely with personal data

On this page: anonymisation, pseudonymisation, de-identification, synthetic data, encryption, secure computing, computation

We are creating an overview of potential privacy-related tools for deidentifying data, creating synthetic data, and analysing data in secure environments in a GitHub repository.

Please feel free to open an Issue or a Pull Request in this repository if you wish to adjust the existing content or add new content.

Requirements for a third-party tool

On this page: custom tool, provider, agreement, third-party tool, service

If your tool of choice is not listed in https://tools.uu.nl, but it does process personal data, please contact the IT servicedesk. They will help you assess whether a tool is safe to use.

If a tool is processing personal data, the following two aspects are important to consider:

1. Who is processing the personal data: arrange an agreement

When you use a third-party tool that processes personal data, the data are not under your (full) control. In this case, you must ensure the GDPR compliance of the tool provider using (art.46):

  • A Data processing agreement - when the provider processes (e.g., stores, analyses, collects) personal data within the European Economic Area (EEA) or a country with an adequate level of data protection.
  • Standard contractual clauses (SCCs) - when personal data are processed by a supplier outside of the EEA without an adequate level of data protection. These make sure the provider will use sufficient measures to protect the personal data and enable data subjects to exercise their rights.
  • Explicit consent of data subjects who have been informed on the risks involved - in the absence of an agreement. Please contact your privacy officer if you are considering this option.

You can assume agreements are in place for the tools recommended by UU. If there is no agreement in place between UU and the tool provider, using this tool is not allowed, even if the provider is located within the EEA, has an adequate level of data protection, or has high security standards. The only exception is when data are always end-to-end encrypted, because then the tool provider cannot learn anything from the data.

2. Security level

The tool provider should employ good security practices, such as regular back-ups in distinct geographical areas (preferably in replication rather than on tape), regular integrity checks, encryption at rest, multi-factor authentication, etc. Most of these aspects will likely be covered in the agreement, and sometimes a data classification will need to be performed. Information security can help you determine all necessary security requirements.

(PART*) Storage, Sharing, Publication

Storing personal data

In research, storage of personal data is one of the most common processing activities. Assuming you have a legal basis to store personal data, you then need to:

  • Choose a storage medium that is GDPR-compliant and that provides a sufficient level of data protection;
  • Take into account procedural and legal aspects, e.g., how will you handle the data once they are stored, and for how long will you store the data?

These aspects of storing personal data are discussed in this chapter.

Chapter summary

Where should I store personal data?

Use a medium that has been approved by your institution. If you work at Utrecht University, and your preferred storage medium is not included in the Storage Finder, then please contact RDM Support or your local data manager to find an alternative solution.

How to store personal data?

  • Apply organisational and technical safeguards, e.g., restrict access, encrypt data, pseudonymise data, specify responsibilities, etc.
  • Store (personal) data preferably in a structured, commonly used, machine-readable and interoperable format: others should be able to open, understand and work with your data.

For how long should I store personal data?

  • Delete or fully anonymise personal data when they are no longer necessary, and preferably determine when you will do this in advance.
  • In research, you can archive personal data that are necessary for validation purposes for a longer period of time, e.g., 10 years or longer.

Where should I store personal data?

On this page: storage, location, medium, yoda, o-drive, u-drive, usb stick, google drive, onedrive, teams, surfdrive, paper, security

If you work at Utrecht University (UU), you can find a suitable storage medium for digital research data via the Storage Finder. For personal data, select Sensitive or Critical (depending on the sensitivity of your data) under question 4 about Confidentiality.

Most storage media in this overview are suitable for storing personal data, either because they are controlled by UU (e.g., U- and O-drive, Beta File System) or because UU has a Data Processing Agreement in place with the storage supplier (e.g., Microsoft Office 365, Yoda).

Is your preferred storage medium not included in the storage finder? Contact RDM Support or your local data manager to find an alternative solution.

  • Consider encrypting your data, especially when using portable devices (e.g, memory sticks, phones, dictaphones). Portable devices are also not suitable as back-up, due to bit rot and being easily lost.
  • Physical personal data (e.g., paper questionnaires, informed consent forms) should be stored securely too, e.g. in a locked room, cabinet or drawer. You should also avoid leaving unsecured copies lying around (e.g., on a desk or printer).

Do not store research data containing personal data on public cloud services, e.g., Google Drive, Dropbox, OneDrive, Box, Mega, iDrive, iCloud, NextCloud, etc. These services are not (always) GDPR-compliant and/or may not offer sufficient data security. Moreover, UU does not have any formal agreements with these services, enabling them to use the data stored on their platforms for their own purposes.

How should I store personal data?

On this page: access control, accountability, interoperability, interoperable, separate, anonymise, pseudonymise, de-identify

Once you have chosen a suitable storage medium, you should act in accordance with the nature of your data as well, for example through:

  • Controlling access: make sure that only the necessary people have the right kind of access (e.g. read/write) to the personal data, and remove their access when they do not longer need it (e.g. when someone leaves the research project).
  • Specifying responsibilities, e.g. who is responsible for guarding access to the data on both the short and the long term? Make people aware of the confidential nature of the data. Tell them what to do in case of a data breach.
  • Procedural arrangements, e.g. capture access conditions in agreements like the consortium agreement, data processing agreement or non-disclosure agreement.
  • Storing different types of personal data in different places, e.g., research data should be stored separately from data subjects’ contact details.
  • Applying other safeguards where appropriate, e.g., encryption, pseudonymisation or anonymisation, etc.).

See Designing a GDPR-compliant research project for more tips.

Personal data should be stored in a “structured, commonly used, machine-readable and interoperable format” (rec. 68). In practice, this means that you should consider whether your files are structured and named in a logical way, use sustainable file formats, and provide understandable metadata so that others can interpret the data. You can read more about this in the RDM guide “Storing and preserving data”.

For how long should I store personal data?

On this page: retention, storage period, duration, remove, delete

As per the GDPR, anyone processing personal data can only store those for as long as is necessary for prespecified purposes (art 5(e)). Afterwards, the personal data have to either be fully anonymised or deleted. However, there is an exemption for research data, as described below.

In research, we often see a division in different types of retention periods:

  • If the personal data underpin a scientific publication, it is usually necessary to archive some personal data for integrity and validation purposes (art 5(e)), because they are part of the research data. At UU, any research data necessary for validation should be archived for at least 10 years (UU research data policy). If this includes personal data, they too should be archived. Importantly, this still means that you need to protect the personal data, and limit the personal data stored to the amount necessary for validation (art. 89)! This also implies that you should keep the documentation about the legal basis used (e.g., consent forms) during that time, so that you can demonstrate GDPR compliance.
  • Specific retention periods may apply additionally to specific types of data. For example, in the Netherlands there are specific retention periods for medical data that range between 10 and 30 years at minimum.
  • Personal data that were used for purposes other than answering your research question (e.g. contact information) should have their own retention policy: they should be removed or anonymised after the retention period (e.g. the research project) has ended.

If identification of the data subject is no longer needed for your (research) purposes, you do not need to keep storing the personal data just to comply with the GDPR, even if it means your data subjects cannot exercise their rights (art. 11).

For all types of data in your project (incl. to be archived research data), we recommend to formulate which data you will retain and for how long (for example in your Data Management Plan), and communicate the (possibly different) retention period(s) to data subjects. If you want to change the storage term you initially set and communicated for your personal data, please contact your privacy officer.

Deleting personal data

If you do not need personal data anymore, you must delete it, except when the data should be archived for validation purposes. When deleting data, it is important to make sure that there are no visible or hidden copies being left behind and that files cannot be recovered. The Storage Finder indicates how you can fully delete data on storage media within UU that are suitable for personal data. For your own file system, you can use software like BleachBit, BCWipe, DeleteOnClick, and Eraser to delete data.

Sharing data with collaborators

On this page: share, transfer, collaborate, consortium, outside EU, EEA, security, legal basis, transparency, transparent, third-party transfer

This chapter addresses guidelines to take into account when you want to share personal data with collaborators outside of your own institution during your research project. For guidelines to share personal data after a research project, please refer to the chapter on Data sharing for reuse.

To be able to share personal data with external collaborators, you should:

1. Make sure you have a legal basis and inform data subjects
  • Make sure data subjects are well-informed about your intentions to share the data with collaborators. Include information in your information to data subjects on the identity of your collaborators, which data are shared with them and why, how, and for how long. Avoid using statements that preclude sharing such as “Your data will not be shared with anyone else”.
  • Make sure you have a legal basis to share the data, e.g., informed consent or public interest. If you use consent, make sure that data subjects are aware that they are also providing consent to share their data with your collaborators.
  • Inform data subjects timely - before you start processing their data - and proactively - directly if possible.
2. Protect the personal data appropriately
  • Assess the risks of sharing the data and the measures you will take to mitigate those in your Data Management Plan, privacy scan, or if applicable, Data Protection Impact Assessment. This is especially important if you will share your data with collaborators outside of the European Economic Area.
  • Share only the data that the collaborator needs (data minimisation), for example by deleting unnecessary data, pseudonymising the data, and sharing only with those who need access to the data.
  • Make sure data subjects can still exercise their data subjects’ rights. For example, if a data subject withdraws their consent, not only you, but also your collaborators will have to stop processing the data subject’s personal data. It is important to make clear how you and them will do so.
3. Come to agreements with collaborators
In order to protect the personal data effectively, it is important to determine which role every collaborator has: controller or processor? And if there are multiple controllers, are they separate or joint controllers? For example, in many collaborative research projects (e.g., in consortia), there are multiple controllers that collectively determine why (e.g., research question) and how (e.g., methods) to process personal data. These parties are then joint controllers, and agreements need to be made in a joint controllers agreement.

In any collaboration in which data are shared, you need to (art. 26):
  • Come to a formal agreement on:
    • The role of each party in the research project
    • Respective responsibilities in terms of data protection, such as informing data subjects and handling requests relating to data subjects’ rights
    • Who is the main point of contact for data subjects
  • Communicate (the essence of) the agreement to data subjects.
Your privacy officer can help you draw up a valid agreement.
4. Pay special attention when sharing personal data outside the EU

If you share personal data with international collaborators (for example, with countries that have no adequacy decision, you may need to take additional measures. Usually, these measures include drawing up an agreement to make sure the other party is GDPR-compliant and uses the necessary security measures (if you haven’t already done so). The exact type of agreement will depend on your specific situation: your privacy officer can help you choose and set up the right one.

The flowchart below indicates conditions under which you can share data internationally. Note that they assume that you have taken sufficient safeguards to protect the personal data. To determine the possibilities of sharing data internationally in your project, we strongly advise you to consult with your privacy officer. In some cases a Data Transfer Impact Assessment may be required, which can take some effort. Flowchart about 
  international data transfers. If you are transfering data to countries 
  within the European Economic Area, no additional measures are required. If 
  you however transfer data outside of the European Economic Area, we 
  recommend to contact your faculty privacy officer to take the proper 
  measures.

5. Use a secure way to share the data
  • Granting access: It is preferable to grant a user access to an existing and safe infrastructure (e.g., add someone to a Yoda group or OneDrive folder), rather than physically sending the data elsewhere. This allows you to keep the data in one place, define specific access rights (read/write), have users authenticate, and easily revoke access to the data after your collaboration has ended. It is also a good idea to take measures to prevent the data from being copied elsewhere.
  • Transferring data: When it is absolutely necessary to transfer the files to a different location, you must do so securely. Researchers at Utrecht University can use SURF Filesender with encryption.

Sharing data for reuse

On this page: publication, publish, share, transfer, open science, open data, FAIR data, reuse, reproducibility

In the context of Open Science, it is becoming more important to share data with other researchers, so that they can reproduce results and reuse the data to answer new research questions. This can be challenging when you work with personal data. Here, we list a few options for sharing personal data for reuse responsibly, and making datasets that contain personal data Findable, Accessible, Interoperable and Reusable (FAIR).

How you can make your dataset FAIR, from a privacy perspective, depends on which scenario applies to you:

If you are in doubt whether you can share personal data for reuse, please ask your privacy officer for help.

Sharing anonymised data

On this page: anonymous, publication, share, transfer, open science, open data, FAIR data

If (part of) the data are truly fully anonymous, they are not classified as personal data anymore: from a privacy perspective, you can publish this anonymised (part of the) dataset in a data repository without restrictions.

The data should indeed be fully anonymous. Here you can find how you know that your data are anonymous. In this preprint, you can find some examples of poorly anonymised (hence still personal) datasets in the field of psychology. It can be very difficult to fully anonymise personal data, so when in doubt, we recommend to always treat data as personal.

Just because data are no longer subject to the GDPR, it does not mean that there may not be other concerns for sharing data publicly. For example:

  • publishing the data may not be ethically responsible when the data can be used to discriminate against a group of people.
  • data may be someone else’s intellectual property.

Publish your data in a data repository and include sufficient documentation and an open license to your dataset, to make your data FAIR. This way, others can find, access, understand and reuse your data.

Alternatives to sharing personal data

On this page: metadata, documentation, information, publication, share, transfer, open science, FAIR data, reproducibility

Publish metadata and documentation

Even if you cannot share/publish the data, you can still publish non-sensitive metadata and documentation surrounding your research project. This allows your dataset and documentation to be findable, citable, and in some cases even reusable (one person’s metadata is another person’s data!). In order to make the dataset FAIR, you should include a note on the access restrictions of the dataset and choose a good data repository. Knowing that your dataset exists can sometimes already be useful information, even when the data are not accessible for others. For an example, please refer to the use case about the Open Science Monitor.

Use other techniques and strategies to enable reuse

There are also more technical alternatives to transferring personal data to others:

  • Use solutions that allow others to run analyses on your data, without ever needing access to those data (remote data science, see the Secure computing chapter).
  • Create a synthetic dataset that others can use to reproduce trends or explore the data.
  • Only allow differentially private algorithms to query your dataset.
  • Publish aggregated (anonymous) data which may still be useful for others (e.g., group-level statistics).

(PART*) Use Cases

Data minimisation in a survey

On this page: minimise, limit, remove, questionnaire, survey

For a course, a teacher at the faculty of Veterinary Medicine collected data on the health of pets and the pets’ owners. The initial purpose of the survey was to create simply datasets for students to learn about statistics. However, besides for the course, the teacher also wanted to use the collected data for research purposes and share the data with others. In order to do so, the teacher created a new version of the survey that asked for less identifiable information and could be more easily anonymised. Additionally, the new version of the survey informed participants about the legal basis used to process their personal data.

Here, you can find the survey before and after data minimisation:


Note that the new version of the survey:

  • minimises the amount of personal data collected:
    • Student number and pet names are not asked in the new version of the survey.
    • Instead of Age, the new version asks the Age category of the owner/caretaker.
    • The survey includes questions on Weight and Height. For data publication, they are used to calculate the Body Mass Index (BMI) and deleted after this calculation.
  • contains information about the legal basis used to be able to use (legitimate interest) and publish (consent) the data for purposes other than education.

Data pseudonymisation

On this page: pseudonymous, de-identification, replacement, open science, reuse

YOUth (Youth of Utrecht) is a longitudinal child cohort study that collects data about the behavioural and cognitive development of children in the Utrecht area. The study follows about 4000 children and their parents in two cohorts. One from birth until around the age of six, one from around 9-years-old until adolescence. YOUth collects a wide variety of data types, ranging from questionnaires to biological samples. Because of the large amount of data and the sensitive nature of the data and the participants (minors), the data can be considered as very sensitive, and thus should be pseudonymised where possible.

General steps

YOUth is committed to sharing their data for reuse, and thus the datasets that they share should contain as little personal information as possible. For that purpose, the YOUth data manager implements a number of measures:

  • All data are pseudonymised as much as possible (see below).
  • Every dataset that is shared for reuse is first checked for identifiable information. Special category information is taken out of the datasets as much as possible, and no unnecessary information such as date of birth is shared.
  • Using the tool AnonymoUUs, participant pseudonyms are replaced with artificial pseudonyms, and all dates with a fake date, each time a new set of data is prepared for sharing. This limits the ability of external researchers to link multiple requested datasets together and thus to form a more complete image of each participant. It also prevents singling out participants based on the day they visited the research centre.

Pseudonymisation per data type

Below is an overview of the data types and pseudonymisation measures taken by the YOUth data manager. Besides these pseudonymisation measures, YOUth has implemented a data request procedure which delineates the conditions under which researchers can access the data, and the steps they have to take to request access.

Questionnaire data (tabular)

Children and their parents/caretakers (sometimes their teacher) fill out several questionnaires about, among others, their mental and physical development, living conditions, and social environment.

Pseudonymisation measures:
  • A script removes unnecessary (special category) personal data from the shared dataset where possible, such as religion, ethnicity and open text responses.
  • If a researcher needs demographic information only to describe the sample, the data manager shares a frequency table of the requested information, for example for ethnicity and socio-economic status, instead of sharing the raw responses.
  • The Anonymouus tool replaces the pseudonym and date in the questionnaire data and file names.

In the future, the data manager would like to share only scale scores, instead of responses to individual questions in standardised questionnaires.

Computer tasks (tabular)
On a computer, children play various games to measure cognitive and motoric development of the child. In most games, the response times, choices and scores are recorded. To pseudonymise the data, the AnonymoUUs tool replaces the pseudonym and dates in the task data and filenames and in some cases even the name of the participant.
Logbook- and experiment book data (tabular)
Notes about data collection (data quality, task-order, if experiment started etc.) are made in logbooks by means of a data capturing tool. In that same tool, YOUth also collects research data about body measures (length, weight and head circumference) and intelligence (WISC and WPPSI) To pseudonymise those data, the AnonymoUUs tool replaces the pseudonym and date in the filenames and data.
Video tasks (video recording)

During two tasks (the Hand game and the Delay of gratification task), children are video- and audiotaped to be able to analyse their behaviour. Parents may also be visible in the background, as well as a research assistant.

To pseudonymise these data, both the videos from the Hand game and the Delayed gratification task will be coded/scored on the variables of interest (e.g., does the child take the candy out of the bag or not). This way, no actual video recordings will need to be shared with other researchers.

Parent-child interaction (video recording)
Children and their parents are videotaped while they play with each other or discuss specific topics. Because these data are difficult to pseudonymise and could be scored/coded on many different aspects, YOUth provides a special local laboratory space to perform the desired qualitative analysis on these video data.
Magnetic Resonance Imaging (MRI) data (3D image)

MRI data of children are collected to study structural (3D image of the brain, skull, and outer layers of the head) and functional (brain activity) properties of the brain.

To pseudonymise the MRI data, structural MRI scans (DICOM) are defaced using mri_deface (v1.22), resulting in NIfTi files. Additionally, the AnonymoUUs tool replaces the pseudonym in the filenames.

Electro-encephalography (EEG) data (video and text files)
A cap is placed on the child’s head with electrodes attached to measure brain activity. The child is placed in front of a monitor and views various on-screen stimuli (incl. faces, objects, sounds, music, toys). A video is also made to check whether the child watches the screen. For the moment, the videos will not be shared with external researchers. In the EEG data itself, the AnonymoUUs tool replaces the pseudonym and date.
Eye tracking data (text files)
Children are placed in front of a screen and view various stimuli (incl.  faces, objects, sounds, music, toys), with or without an assignment. Eye movements and focus points are recorded using an eyetracker. To pseudonymise these data, the AnonymoUUs tool replaces the pseudonym and date in the eyetracking data and the filenames.
Ultrasound images (3D echos)
During the mothers’ pregnancy, 3D ultrasound images are made of foetuses to follow overall and brain size development. To pseudonymise these data, the ultrasound images (DICOM) will be converted to nifti (.nii) format, which does not contain header information. Additionally, the AnonymoUUs tool replaces the pseudonym and date in the filenames and in the SQL database that comes with the measurement.
Biological materials
At various moments during the study, (chord) blood, hair, saliva, and buccal swabs are taken from the child and sometimes their parent(s). The samples cannot be pseudonymised, because they are physical samples. Instead, a procedure is in place to have biological samples analysed at preferred partners, without having to share the physical samples with researchers.

Publishing metadata

On this page: FAIR data, metadata, documentation, publication, reuse

In 2020, the Open Science Programme of Utrecht University sent out the first Open Science monitor. The aim was to gain insights into the awareness, attitudes, practices, opportunities and barriers of employees of Utrecht University and Utrecht University Medical Center regarding several Open Science practices. As the dataset contained a lot of demographic information (e.g., gender, age, nationality, position, type of contract, etc.), and all of those variables combined could lead to identification, it could not be shared publicly. For this particular dataset, full anonymisation was not desirable, as that would greatly decrease its scientific value. Therefore, the Open Science Programme chose to publish only the metadata and documentation, without sharing the data, in order to protect participants’ data while still complying with the FAIR principles.

Here’s the strategy they took:

Note that in the metadata of all these publications, cross-references to the other publications are included to allow for maximum findability of the project’s outputs.

Reusing education data for research

On this page: further processing, secondary use, reuse, student data, education, legal basis, access control

A research group at the Science faculty wanted to investigate the effects of the Covid-19 pandemic on students’ motivation and study success in a specific course. To do so, they wanted to analyse:

  • Students’ evaluations of the course from both before and during the pandemic.
  • Students’ test and final grades in the course from both before and during the pandemic.

The primary researchers already had access to the data for their educational activities, and so they wanted to use the data for research purposes. They went to their faculty privacy officer to find out how they could reuse these data in a responsible way.

The following privacy issues are relevant in this use case:

  • The raw data were identifiable
    The student grades were linked to names, and both the grades and the evaluations were linked to student IDs. Moreover, the evaluations could potentially contain names of teachers and other personal information, as they consisted of partly open-ended questions. To decrease identifiability, the principal investigator and a second examiner, who already had access to the students’ data, first removed or replaced all names with pseudonyms (both names of student and teachers), and went through the open-ended questions to remove potentially directly identifiable information. Only after deidentification were the data shared with research assistants who performed the main data and content analyses.

  • Data subjects’ rights
    Most students had already finished the course, and were not informed about the use of their evaluations and grades for this research project. The researchers argued that the majority of the students could not be traced anymore to provide this information or to enable them to exercise their data subjects’ rights (art. 14(5)(b)). Moreover, in case a student did want to exercise their rights, it would prove difficult to retrieve the correct data, as the data were deidentified as soon as possible.

  • Legal basis
    Students did not provide explicit consent to process their grades and evaluations for this research project. Moreover, if they had provided consent, it could be argued that the consent was not freely given, as the primary researchers were also involved as teachers, and therefore there was a hierarchical relationship between the students and the teachers. For these reasons, consent was not a suitable legal basis in this case. Instead, the researchers relied on:
    • Public interest: processing students’ data for the course itself is a public task, namely that of providing education. It was the legal basis for the initial data collection.
    • Further processing for scientific research purposes: processing data to answer the research question can be considered as secondary use of the students’ personal data. The GDPR does not consider secondary use of personal data for scientific research purposes incompatible with the original purpose (i.e., the original purpose being to provide education and improving the course, art. 5(1)(b)). Thus, it was not necessary to rely on a new legal basis for this research project, provided that the data were protected sufficiently: The researchers made sure that the data were well-protected (i.e., minimised, pseudonymised, and access controlled, art. 89).

(PART*) Resources

Seeking help at Utrecht University

If you work at Utrecht University, there are several ways to look for further support.

Education

Research Data Management Support currently offers:

Additionally, your own faculty or department may offer workshops surrounding privacy, ethics and/or research integrity.

Online information

Besides this Handbook, you can find more information on the following websites:

In-person support

The first point of contact about privacy is the privacy officer of your faculty. Besides the privacy officer, you can also ask for help from:

Glossary

The glossary consists of frequently used jargon concerning the GDPR and research data.

A

Anonymous data
Any data where an individual is irreversibly de-identified, both directly (e.g., through names and email addresses) and indirectly. The latter means that you cannot identify someone:
  • by combining variables or datasets (e.g., a combination of date of birth, gender and birthplace, or the combination of a dataset with its name-number key)
  • via inference, i.e., when you can deduce who the data are about (e.g., when profession is Dutch prime minister, it is clear who the data is about)
  • by singling out a single subject, such as through unique data points e.g., someone who is 210 cm tall is relatively easy to identify)

Anonymous data are no longer personal data and thus not subject to GDPR compliance. In practice, anonymous data may be difficult to attain and care must be given that the data legitimately cannot be traced to an individual in any way. The document Opinion 05/2014 on Anonymisation Techniques explains the criteria that must be met for data to be considered anonymous.

C

Controller

The natural or legal entity that, alone or with others, determines or has an influence on why and how personal data are processed. On an organisational level, Utrecht University (UU) is the controller of personal data collected by UU researchers and will be held responsible in case of GDPR infringement. On a practical level, however, researchers (e.g., Principal Investigators) often determine why and how data are processed, and are thus fulfilling the role of controller themselves.

Note that it is possible to be a controller without having access to personal data, for example if you assign an external company to execute research for which you determined which data they should collect, among which data subjects, how, and for what purpose.

D

Data subject
A living individual who can be identified directly or indirectly through personal data. In a research setting, this would be the individual whose personal data is being processed (see below for the definition of processing).

E

European Economic Area (EEA)
The member states of the European Union and Iceland, Liechtenstein, and Norway. In total, the EEA now consists of 30 countries. The aim of the EEA is to enable the “free movement of goods, people, services and capital” between countries, and this includes (personal) data (source: Eurostat).

G

General Data Protection Regulation (GDPR)
A European data protection regulation meant to protect the personal data of individuals, and facilitates the free movement of personal data within the European Economic Area (EEA). The Dutch name of the regulation is “Algemene Verordening Gegevensbescherming” (AVG).

H

Hashing
Hashing is a way of replacing one or multiple variables with a string of random characters with a fixed length. It can be used to create a “hashed” pseudonym, or to replace multiple variables with one unique value. It is usually quite difficult to reverse the hashing process, except if an attacker has knowledge about the type of information that was masked through hashing. To prevent reversal, cryptographic hashing techniques add a “salt”, i.e., a random number or string, to the hash (the result is called a “digest”). If the “salt” is kept confidential or is removed (similar to a keyfile), it is almost impossible to reverse the hashing process.

L

Legal basis
Any processing of personal data should have a valid legal basis. Without it, you are now allowed to process personal data at all. The GDPR provides 6 legal bases: consent, public interest, legitimate interest, legal obligation, performance of a contract, and vital interest. Consent and public interest are most often used in a research context.

P

Personal data

Any information related to an identified or identifiable (living) natural person. This can include identifiers (name, identification number, location data, online identifier or a combination of identifiers) or factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of the person. Moreover, IP addresses, opinions, tweets, answers to questionnaires, etc. may also be personal data, either by itself or through a combination of one another.

Of note: as soon as you collect data related to a person that is identifiable, you are processing personal data. Additionally, pseudonymised data is still considered personal data. Read more in What are personal data?.

Processing
Any operation performed on personal data. This includes collection, storage, organisation, alteration, analysis, transcription, sharing, publishing, deletion, etc.
Processor
A natural or legal entity that processes personal data on behalf of the controller. For example, when using a cloud transcription service, you often need to send personal data (e.g., an audio recording) to the transcription service for the purpose of your research, which is then fulfilling the role of processor. Other examples of processors are mailhouses used to send emails to data subjects, or Trusted Third Parties who hold the keyfile to link pseudonyms to personal data. When using such a third party, you must have a data processing agreement in place.
Pseudonymous data
Personal data that cannot lead to identification without additional information, such as a key file linking pseudonyms to names. This additional information should be kept separately and securely and makes for de-identification that is reversible. Data are sometimes pseudonymised by replacing direct identifiers (e.g., names) with a participant code (e.g., number). However, this may not always suffice, as sometimes it is still possible to identify participants indirectly (e.g., through linkage, inference or singling out). Importantly, pseudonymous data are still personal data and therefore must be handled in accordance with the GDPR.

S

Special categories of personal data
Any information pertaining to the data subject which reveals any of the below categories:
  • racial or ethnic origin
  • political opinions
  • religious or philosophical beliefs
  • trade union membership
  • genetic and biometric data when meant to uniquely identify someone
  • physical or mental health conditions
  • an individual’s sex life or sexual orientation
The processing of these categories of data is prohibited, unless one of the exceptions of article 9 applies. For example, an exception applies when:
  • the data subject has provided explicit consent to process these data for a specific purpose,
  • the data subject has made the data publicly available themselves,
  • processing is necessary for scientific research purposes.

Contact your privacy officer if you wish to process special categories of personal data.

Resources

For further reading, we prepared a Zotero library with additional resources, some of which are specific to Utrecht University, others more general. Click on the image below to see the most recent version of the reference library online.